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HALO EFFECT IN REVERSE— 
ARE TEACHERS’ RATINGS OF 
HIGH-SCHOOL PUPILS VALID? 


H. H. REMMERS AND ROBERT D. MARTIN 


Purdue University 


I, INTRODUCTION 


This study was conducted to determine the ability of high- 
school teachers to differentiate between the maturity of high- 
school juniors and seniors. 

From most available information, it is rather evident that 
high-school juniors, as a whole, are less mature than high-school 
seniors. In order to determine the degree to which high-school 
teachers’ judgments would discriminate between juniors and 
seniors on relevant characteristics, an operational definition of 
maturity was prepared in the form of a rating scale. 


Il. THE RATING SCALE AND ITS ADMINISTRATION 


The instrument used in the present study was ‘‘The Purdue 
Maturity Rating Scale’, constructed by H. H. Remmers: It is 
a graphic scale containing thirteen items. Each item is arbi- 
trarily scaled from +10 through zero to —10 with descriptive 
phrases at each end of the continuum to aid in rating. The rater 
is asked to rate the student in comparison with the ‘average 
high-school senior.’ 

The thirteen traits of the Scale are in order: 1. Personal 
appearance; 2. Physical maturity; 3. Ability to manage own 
financial affairs; 4. Study habits (budgeting time, sticking to 
job, planning work, etc.); 5. Health habits (food, clothing, rest, 
recreation, etc.); 6. Available energy; 7. Liability to homesick- 
ness; 8. Healthfulness of associations with opposite sex; 9. 
Ability to get along with others; 10. Desire to succeed, moti- 
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vation; 11. Clearness of goals, definiteness of purpose; 12. 
Ability to make sound decisions; and 13. Ability to look after 
self independent of adults. 

The graphic scale used for each of the traits is exemplified by 
trait 1, Personal appearance. 

1. Personal appearance 





+10 0 —10 

| oe SS SE OS oe, a 2 OP ee ee 
Very attractive Very unattractive 
and pleasing and unpleasant 


The scale was administered at a district high school in Indiana. 
This school was chosen because of its size and because it is a 
rather typical public high school. It had at the time of the 
experiment a teaching staff of fourteen teachers and an enroll- 
ment of 389 in grades IX to XII inclusive. All of the juniors 
and seniors would, therefore, be fairly well-known by the mem- 
bers of the teaching staff. Fifty-six juniors and seniors were 
selected at random, by taking each nth student from an alpha- 
betical list. 

Each of the fourteen teachers rated each of the fifty-six 
selected students whom he felt he knew well enough to rate 
accurately. These randomly selected students received from 
three to eleven ratings each. After some imperfect rating scales 
had been discarded, there were twenty juniors and twenty seniors 
who had been rated by six or more teachers. These forty . 
students were chosen for the final experiment and six ratings on 
each were used, the extra scales being eliminated by a random 
process. Of the twenty juniors there were thirteen girls and 
seven boys while eleven girls and nine boys comprised the senior 
group. 

The scale was changed into a positive scale to facilitate the 
scoring and statistical computations. Ten units were added to 
each rating so that the highest rating was twenty and the lowest 


was zero. 
III. THE RELIABILITY OF THE ITEMS 


The reliability of each item was calculated by using the split- 
half method. The six ratings for each student were split into 
chance halves and each half was summed for each item for all of 
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the students. The correlations between these chance halves 
were obtained on each item by the Pearson product-moment 
method. The reliability of the six raters on each item was 
determined by using the Spearman-Brown prophecy formula. 
These reliabilities are shown in Table 1. 

These reliabilities of the ratings, considering the fact that they 
are from the rating of only six teachers and a total of forty 
students, were quite high. Only three items fell below .60. All 
are sufficiently high to yield stable averages for the group com- 
parisons presented later in this paper. 


IV. THE PREDICTION OF THE VALUES OF r BY THE 
SPEARMAN-BROWN FORMULA 


The ability of the Spearman-Brown formula to predict the 
increased reliability of ratings for an increased number of raters 
has been shown many times.'?**!0 IJtems 1, 2 and 11 were 
chosen, somewhat at random, to test this hypothesis for the 
present study. 


TABLE 1.—SuMMARY OF RELIABILITIES OF ITEMS 
Reliability Reliability 


Item (3 raters) (6 raters) 

1. Personal appearance............. .61 .76 

2. Physical maturity. . Packie 47 . 64 
3. Ability to manage own financial 

Ai Mie eds. Se diaian aoe 84-4 41 .58 

ae eeeesacea ey .77 . 87 

St hs scat 0006 604.0 0's 6 .61 .76 

6. Available energy................. .51 .67 

7. Liability to homesickness......... .32 .49 
8. Healthfulness of associations with 

SR eee err Te .35 .52 

9. Ability to get along with others. ... .53 .69 

10. Desire to succeed, motivation..... .75 .86 
11. Clearness of goals, definiteness of 

Dei diet abeccki vas s'¢6 6-808 79 .88 

12. Ability to make sound decisions. . . 71 .83 


13. Ability to look after self inde- 
pendent of adults................ .69 81 
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The procedure was as follows: The correlations were obtained 
for the fifteen possible combinations of the six raters. From 
the mean of these correlations, the expected reliability of three 
raters was predicted by the Spearman-Brown formula: 


oe nr, 
Ma 1+ (n —_ 1)r; 


where 7; is the reliability of one rater and rz the predicted relia- 
bility of three raters. 

The correlations, both obtained and predicted, were trans- 
formed to ‘‘z’’ functions.’ The differences between the obtained 
and the predicted z’s were found. The standard errors of the 
differences were computed by the formula :® 


1 


Opiff == 





T2 





The critical ratios between the differences and their standard 
errors were computed. This information is shown in Table 2. 


TABLE 2.—COMPUTATION OF CRITICAL RATIOS FOR OBTAINED 
AND PREDICTED 7’s 


r(av.— 

Item lrater) ri.a Tout. in Gia to Ge GC 
1. .33 .60 .61 71 .69 .02 .34 ~ =» .06 
2. 15 re a a oe ee 

11. .53 77 =.78 #1.06 1.02 .04 .34 #.12 


It can be seen from the table that the differences between the 
obtained and predicted reliabilities are within the statistically 
allowable limits. It is concluded that, for these items, increased 
reliability for an increased number of judges can be predicted 
from the Spearman-Brown formula. 


V. THE INTERCORRELATIONS OF THE ITEMS 


A total score for each student was obtained on each item by 
summing the ratings of the six teachers. The correlations 
between the items were computed by the Pearson product- 
moment method. These correlations are shown in Table 3. 
The correlations when corrected for attenuation are shown in 


Table 4. 
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It can be seen from these tables that many of these correlations 
are substantially one, or so high as to indicate that the items were 
measuring the same thing. The intercorrelations between items 
on rating scales has been given several different names, such as 
halo effect? and logical error. According to Symonds,® these 
correlations are highest when the items rated are: (1) not easily 
observable, (2) not frequently singled out or discussed, (3) not 
clearly defined, and (4) of high moral importance. Most of 
these items fall into one or more of these categories. 


TABLE 3.—CORRELATIONS BETWEEN ITEMS (UNCORRECTED) 


Item 
Ft Be¢€§ 6 7&8 Pe Bee 


1. .68 .72 .55 .74 .61 .38 .71 .64 .56 .58 .44 .60 
2. 47 .26 .29 .77 .53 .45 .60 .31 .30 .20 .46 
3. .77 .86 .73 .41 .68 .73 .82 .82 .80 .86 
4. .89 .57 .70 .60 .59 .94 .96 .91 .82 
5. .76 .36 .80 .75 .89 .91 .81 .88 
6. .59 .62 .77 .65 .64 .59 .75 
7. .37 .46 .33 .20 .31 .54 
8. .75 .64 .65 .56 .70 
9. .72 .70 .65 .78 
10. .99 .92 .89 
11. .93 .87 
12. .87 
13. 

TABLE 4.—CORRELATIONS BETWEEN ITEMS (CORRECTED FOR 

ATTENUATION) 

Item 


] 2 3 4 5 6 7 8 9 10 11 12 «138 


1. .98 1.09 .67 .97 .86 .63 1.14 .88 .69 .72 .56 .76 
2. 77 1.84 .411.17 .95 .79 .91 .42 .40 .28 .64 
3. 1.08 1.29 1.16 .77 1.24 1.15 1.16 1.14 1.15 1.26 
4. 1.09 .75 1.07 .90 .75 1.08 1.09 1.03 .98 
5. 1.06 .59 1.28 1.04 1.10 1.11 1.02 1.12 
6. 1.04 1.06 1.12 .85 .83 .79 1.01 
7. 74 .78 .56 .380 .48 .85 
8. 1.25 .96 .96 .86 1.08 
9. .93 .90 .86 1.03 
10. 1.14 1.09 1.06 
11. 1.09 1.03 
12. 1.06 
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Apparently many of the items were measuring the same thing, 
whatever it was. From a close examination of Table 1 it may 
be seen that there are two partially distinct groups of items. 
These are items 2, 6, and 7, and items 3, 4, 5, 8, 9, 10, 11, 12 and 
13. The correlations between the members of the former group 
are all above .95, while all but one of the correlations between 
members of the latter group are above .86. These groups appear 
to be somewhat distinct, since lower correlations were found 
between the members of the two groups. A factor analysis 
would be needed to show clearly that these were really two dis- 
tinct groups. Because of the very limited sample used in the 
present study the carrying out of a factor analysis was not 
judged worth the labor involved. 


VI. THE DIFFERENCES BETWEEN JUNIORS AND SENIORS 


The differences between juniors and seniors and the standard 
error of the differences were computed for each item. These 


TABLE 5.—SUMMARY OF DIFFERENCES BETWEEN JUNIORS 
AND SENIORS 


Differ- 

Item ence* D/on P 
1. Personal appearance.............. —6.35 1.53 .07 
2. Physical maturity................. .90 .29 .39 
3. Ability to manage own financial 

ER ee an ae —4.20 1.04 .16 
ee 2 eee Pee —11.25 1.62 .06 
OE A ee —9.95 2.18 .02 
6. Available energy..:............... —3.85 .93 .18 
7. Liability to homesickness.......... 1.20 .30 .38 
8. Healthfulness of associations with 

SI ic Gis Gh ns 64 0.96.00 0 6: —12.75 2.92 .004 
9. Ability to get along with others.... -—9.05 2.15 .02 

10. Desire to succeed, motivation...... —10.80 1.84 .04 

11. Clearness of goals, definiteness of 

és SG et hates es nweeae see ses —11.40 1.77 .05 

12. Ability to make sound decisions.... -—9.85 1.60 .06 

13. Ability to look after self independent 

of adults.........:... Sivsces hee. £2 88 


* Positive difference favors seniors, negative difference favors juniors. 
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differences and the probability that they could have arisen by 
chance are shown in Table 5. 

It can be seen from the table that the teachers favored the 
juniors on eleven of the thirteen items. Although only five 
of these differences are significant at the five-per-cent level, all 
of them show a substantial difference, greater than the difference 
on the two items favoring the seniors. The probability of the 
juniors being favored on eleven of the items, had there been no 
difference whatsoever on any of the items, would have been 
less than .02, since that is the probability of eleven of thirteen 
equal chance occurrences falling in a particular direction. 

Item 1, “personal appearance”’, is not a direct measure of 
maturity, so it is, perhaps, not so surprising that the juniors 
were favored. 

Item 2, ‘physical maturity’’, is probably the least debatable 
measure of maturity. One would expect the seniors to show a 
substantial advantage on this item. The actual difference, 
while favoring the seniors, did so by the smallest margin of any 
of the items. 

Item 3, “ability to manage own financial affairs’’, a subtrait 
of maturity, which would be expected to favor the seniors, 
actually favored the juniors by a substantial margin. 

High-school seniors usually study less than juniors, since they 
are affected by the usual rush of senior activities. The probable 
effect of this situation on the judgments of the teachers appears 
in the junior advantage on item 4, “study habits’, item 10, 
“desire to succeed, motivation’, and item 11, “clearness of 
goals, definiteness of purpose’’. 

Item 5, ‘“‘health habits”, item 9, “ability to get along with 
others’, item 12, “ability to make sound decisions”, and item 13, 
“ability to look after self independent of adults”’ are all measures 
of subtraits of maturity which would be expected to favor the 
seniors. All of them, however, favored the juniors, items 5 and 9 
being significant at the two-per-cent level. 

It is interesting to note that the item of high moral importance, 
“healthfulness of associations with opposite sex”, showed the 
most significant difference in favor of the juniors. 

Item 6, “‘available energy’’, could have been expected to 
have favored either group. Item 7, “liability to homesickness’, 
the least reliable of the items, showed no substantial difference. 
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It is concluded from the differences shown that teachers, 
although agreeing among themselves, are unable to rate students 
validly on their maturity as here operationally defined. The 
items which other evidence would indicate should have favored 
the seniors, did not do so. It is, therefore, a reasonable inference 
that in a very real sense these ratings, whatever they may tell 
about the relative maturity of high-school juniors and seniors, 
reveal important characteristics of the high-school faculty who 
did the ratings. Their mores as a frame of reference for their 
ratings apparently are on the whole less offended by juniors than 
by seniors. The hypothesis that this is a social-psychological 
phenomenon general to high schools needs to be further investi- 
gated. If the ‘halo effect’ in reverse here observed is general, it 
calls into question the widely practiced rating of high-school 
pupils. 


VII, SUMMARY AND CONCLUSIONS 


This study was conducted to determine the ability of high- 
school teachers to differentiate between the maturity of high- 
school juniors and seniors as measured by the Purdue Maturity 
Rating Scale. Comparisons were made between the ratings of 
twenty juniors and twenty seniors each of whom was rated by 
six teachers. 

With due regard to the limited data they appear to warrant 
the following generalizations. 

1) For the instrument used and the population tested teachers - 
agree very well in their ratings. 

2) The Spearman-Brown formula is applicable in predicting 
the reliability for an increased number of raters. 

3) Juniors are in general rated higher than seniors. Five of 
the items favoring the juniors are significant at least at the five- 
per-cent level. On several characteristics, which, according to 
other studies, should favor the seniors by a considerable margin, 
the juniors rate better. 

4) If we make the reasonable assumption that maturity is a 
function of, age teachers do not rate juniors and seniors validly. 
The juniors, presumably less mature than the seniors by other 
standards, are rated, as on the whole more mature by their 


teachers. 








THE RELATION OF INDIVIDUAL VARIABILITY 
TO INTELLIGENCE* 


SUSAN W. GRAY 
Florida State College for Women 


The purpose of the present investigation has been to discover 
what relationship, if any, exists between individual variability 
in educational achievement and intelligence. More specifically, | 
the problem has been one. of discovering whether groups selected/ 
as being of high, average, and low intelligence will show differ- 
ences in the amount of individual variability evinced by their 
members. 

Individual variability, as the term is employed in this investi- 
gation, is used to refer to the extent to which an individual differs 
in his several abilities as revealed by a series of measures. At 
present the body of literature upon the subject of such variability 
is extremely limited. What few investigations have been made 
of individual variability have been concerned almost exclusively 
with its extent. Studies which may be cited in this connection 
are those of Hull5, Chen', and Stout®. These studies have 
concurred in finding individual variability to exist to a marked 
degree in the populations studied. To the writer’s knowledge, 
however, there has been only one published study which has 
dealt with the relation of individual variability to intelligence, 
except studies which have been concerned with variability on 
subtests as it is related to total score upon an intelligence test. 
This one study is that of De Voss’, published in the Genetic 
Studies of Genius. In comparing the individual variability in 
attainment upon the Stanford Achievement Test of one hundred 
of Terman’s gifted group and ninety-six control children, De Voss 
found only slight and insignificant differences in the individual 
variability of the two groups. 

In view of the demonstrated existence of individual variability 
and in view of the absence of material upon the relation of indi- 
vidual variability to intelligence, the present study was planned 





* The data upon which this study is based are contained in a thesis pre- 
sented by the writer in partial fulfillment of the requirements for the degree 
of Doctor of Philosophy, George Peabody College for Teachers, 1941. The 
writer wishes to acknowledge the valuable suggestions and criticisms of 
Dr. Paul L. Boynton in the preparation of the study. 
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with the hope that it might render possible some conclusion as 
to the relation of such variability to intelligence among children 
as they are actually found in classroom situations. 


METHOD OF STUDY 


The data used in this investigation have been derived from 
records of the test scores of six hundred children upon the 
Kuhlmann-Anderson intelligence test and upon the Unit Scales 
of Attainment. These children were selected from a group of 
seven thousand sixth-grade pupils for whom scores upon the two 
tests mentioned had been obtained in connection with a project 
of the Coérdinated Studies in Education. Twenty-nine states 
were represented among the cases chosen, the largest number 
being from the Midwest and the Northwest. 

These cases were selected in accordance with certain percentile 
points determined for a normative group of twelve hundred 
children from the sixth-grade records collected by the Coérdinated 
Studies in Education. In this way were chosen for purposes of 
comparison three groups of two hundred children each, one 


* group of superior intelligence, one of average, and one of inferior 


intelligence. For the group of high intelligence, one hundred 
boys and one hundred girls were selected with scores above the 
85th percentile point of this normative group. The mean intel- 
ligence quotients of the groups so selected were 121 for boys and 
122 for girls. The group of average intelligence was chosen 
from among those whose scores lay between the 40th and the 60th 
percentile points. The mean intelligence quotient of the one 
hundred boys was 102, and of the one hundred girls, 104. The 
low group, selected from among those whose intelligence quo- 
tients lay below the 15th percentile point, had mean intelligence 
quotients of 81 and 83, for boys and girls, respectively. It may 
be seen, then, that there was a difference of approximately 


— 20 points in IQ between means of adjacent groups. Furthermore, 


any individual in any group is separated from any individual in 
an adjacent group by at least twenty-five per cent of the norma- 
tive population. 

From certain preliminary observations it appeared that indi- 
vidual variability among the cases studied was in part a function 
of the particular classroom group in a given school to which a 
child belonged. For this reason;*cases were so selected that an 
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equal number of children of high, average, and low intelligence 
were chosen from each class group. It was hoped that such a 
procedure would tend to render equivalent the influence of 
local environmental conditions upon the three groups. 

No attempt was made to equate the subjects for chronological 
age, mental age, economic status, or various factors of personality 
adjustment. The justification of this method of selecting cases 
appears to the writer to lie in the fact that equating children for 
any one of these variables would tend to render groups atypical 
with respect to usual classroom situations. For example, one 
would expect to find a sixth-grade child with an intelligence 
quotient of 120 to be considerably younger than a sixth-grade 
child with an intelligence of 80 except in rather unusual circum- 
stances. Obviously, equating the children for chronological age 
would render groups highly atypical insofar as the classroom 
situation is concerned. The same is probably true to a certain 
extent of other variables for which the three groups might have 
been equated. An attempt was made, then, to select children 
in as random a manner as possible, controlling only the number, 
of children of high, average, and low intelligence selected from 
each classroom group. 

The intelligence quotients which formed the basis of selection « 
of cases were derived from scores upon the Kuhlmann-Anderson 
Tests. The test scores of educational achievement were obtained 
from six subtests of the Unit Scales of Attainment. These 
six were: reading comprehension, geography, elementary science, 
arithmetic fundamentals, spelling, and English usage. These 
six subtests were selected arbitrarily. It was felt, however, that 
they were fairly representative of the fields of subject-matter 
customarily taught in the sixth grade, and that they were fields 
which would not show a high degree of overlapping. 

In obtaining an individual variability score for each subject 
upon the basis of six subtests of the- Unit Scales of Attainment 
the following method was used: 

The individual’s scores upon the six subtests of the Unit 
Scales were first converted into standard scores. These standard 
scores were based upon the deviation in sigma-units of each 
individual from the mean of his own classroom group. The 
writer believed that in view of the many purely local factors 
which may affect an individual’s variability, a measure of 
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_, variability based upon deviations from an individual’s own 
class group would be more significant than a measure based 
upon deviations from some normative group. 

After standard scores were obtained from each case upon the 
six tests of educational attainment, an individual variability 
score was then calculated for each child by finding the sum of all 
possible differences between standard scores of the child. Since 
six tests were employed, there were fifteen possible differences 
between scores. This sum of all possible differences was found 
in accordance with a formula suggested by Hertzman.‘ Because 
of the small number of scores to be considered for each indi- 
vidual, the sum of all possible differences with respect to the six 
scores was considered by the writer as probably a more satis- 
factory index of variability in this situation than the standard 
deviation employed by Hull or the interquartile range used by 
Hertzman® in studies of individual variability. 

By obtaining measures of central tendency and dispersion 
of the sums of all possible differences for the groups of high, 
average, and low intelligence, various comparisons were made 
possible. It is largely in terms of such intergroup comparisons 
that the results of the present investigation have been expressed. 
It will be the purpose of the remainder of the present paper to 
discuss these comparisons and their possible significance. 


THE RESULTS 


The various intergroup comparisons will be discussed in the © 
following order: 

1) Comparisons of the means. 

2) Comparisons of the 90th, 75th, 50th, 25th, and 10th 
percentile points. 

3) Comparisons of the standard deviations. 

4) Comparisons of the means, percentile points, and standard 
deviations of the two sexes. 

The scores which formed the basis of these analyses are given 
in Table 1, which presents the parameters of the distributions 
as well as certain percentile points. 

When the means of the three groups were compared with 
one another, differences between the groups of high and average 
intelligence were small and fell far short of statistical significance, 
there being only from about fifty-one to seventy-one chances in 
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TABLE 1.—A SuMMARY PRESENTATION OF THE INDIVIDUAL 
VARIABILITY SCORES OF THE THREE GROUPS 
Sex High Group Middle Group Low Group 


Boys 12.72 12.36 15.36 
Mean Girls 12.20 12.54 13.52 
Both 12.46 12.45 14.44 
Boys 5.00 4.21 6.46 
Sigma Girls 4.80 4.92 5.36 
Both 4.90 4.60 6.00 
Boys 19.09 18.29 23 .67 
P90 Girls 18.75 19.00 21.00 
Both 18.95 18.62 22.55 
Boys 15.83 15.14 19.00 
Pr. Girls 16.00 14.92 16.62 
Both 15.90 15.04 17.40 
Boys 11.81 11.89 13.78 
P50 Girls 11.69 12.00 13.16 
Both 11.76 11.94 13.46 
Boys 9.20 9.18 10.80 
Pos Girls 8.25 9.18 9.67 
Both 8.71 9.18 10.30 
Boys 7.00 7.09 8.50 
Py Girls 6.47 7.00 6.75 
Both 6.69 7.05 7.69 


one hundred that the true differences between high and middle 
groups were greater than zero. When comparisons were made 
between high and low, and middle and low groups, however, 
significant trends pointing to greater variability on the part of 
the low group were observed. When the groups of boys were 
compared, and the groups of boys and girls combined, differences 
between the low groups and the two higher groups were statis- 
tically significant. The same trend was shown when the three 
groups of girls were compared. While in the case of the girls 
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these differences were not large enough to reach the convention- 
ally accepted criterion of statistical significance, they did indicate 
from about ninety-one to ninety-nine chances in one hundred 
of a true difference greater than zero. 

Somewhat comparable results were obtained when the 90th, 
75th, 50th, 25th, and 10th percentile points of the three groups 
were compared. Again, differences between high and middle 
groups were slight and inconsistent, seven of fifteen possible 
comparisons indicating greater variability in the high group and 
eight in the middle group. Of the thirty possible compari- 
sons, however, between the low group and the other two groups, 
in only one instance did the lower group show less individual 
variability. Although only six of these differences were large 
enough to be considered statistically significant, their high degree 
of consistency would seem to indicate a difference of some 
importance. 

These differences found with respect to the various groups 
would not appear to be explicable in terms of test construction 
alone. If the tests were so constructed that they failed to dis- 
criminate adequately at either the upper or lower levels of 
achievement, one would expect individuals whose achievement 
in the various subjects was closest to the means of the group to 
show more individual variability than those whose scores tended 
to center in either the very high or very low levels. It was 
characteristically the group of average intelligence whose scores 
tended to center around the means of their respective groups. : 
Yet the group of average intelligence showed an average indi- 
vidual variability which was less than that of the low group and 
approximately equal to that of the high group, a finding which 
would seem to negate the hypothesis that the differences in the 
various groups are explicable in terms of test construction alone. 

The individual variability scores of the three groups were 
also compared with respect to the difference in the degree of 
dispersion of the sums of all possible differences shown by the 
three groups. When the standard deviations of the sums of all 
possible differences of the three groups were compared, a tendency 
was observed for slightly greater dispersion to appear in the low 
group than in either of the two higher groups. This trend was 
statistically significant when the individual variability scores 
of the boys were compared and when those of both boys and 
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girls combined were compared, as indicated by standard errors. 
These scores were 4.10 and 3.69 times their standard errors. 
The differences with the groups of girls, however, were slight. It 
appears, then, that not only was greater average individual ¥ 
variability shown by the low group, but that where boys and 
both boys and girls combined were concerned, the low group 
also revealed less homogeneity; instead, a wider scattering of 
scores being shown than with either of the two upper groups. 

A final comparison of the sums of all possible differences was 
made in studying possible sex differences in the degree of indi- 
vidual variability evinced by the subject population. When 
the means of the boys and girls were compared, differences 
between boys and girls of the high and middle groups were 
negligible. Somewhat greater individual variability was shown 
upon the part of the boys in the low group. A difference here 
occurred which indicated about ninety-eight chances in one 
hundred of a true difference greater than zero between the two 
groups. When the boys and girls were compared at each of 
the five percentile points studied in the investigation, in eleven 
of the fifteen possible comparisons the boys showed somewhat 
greater individual variability. The differences, however, were 
small and none was greater than 1.68 times its standard error. 
It would appear, then, that while some slight indication is given K 
of greater individual variability upon the part of the boys this 
relationship, insofar as the present investigation is concerned, is 
questionable. 

The present analysis has not been concerned primarily with 
discovering the extent of individual variability. In view of the 
paucity of data upon the subject, however, such findings con- 
cerning its extent as emerged in the course of this investigation 
might prove of some worth. 

In Table 2 are given the average mean differences with respect 
to the six subtests for the several groups studied. These mean 
differences were found by dividing the sum of all possible differ- 
ences in standard scores for the groups by 15, the number of 
comparisons possible where six scores are concerned. These 
average mean differences are seen to range from 0.810 to 1.024. 
With the assumption that the average individual varies around 
the mean of his class group, the mean difference is found to be 
one which covers from approximately thirty-one to thirty-eight 
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per cent of the area of a normal curve and hence of the range to 
be expected if the group follows a normal distribution. To be 
sure, such an assumption is hardly tenable, since few were the 
individuals who varied exactly as far above as below the means 
of their groups. It does give some indication, nevertheless, of 
the extent of individual variability which may be expected on 
the average. 


TABLE 2.—AVERAGE MEAN DIFFERENCES WITH RESPECT TO 
THE SIGMA-SCORES OF THE THREE GROUPS 


Groups Boys Girls Both 
High Intelligence..................... 0.848 0.813 0.830 
Average Intelligence.................. 0.825 0.836 0.830 
Low Intelligence..................... 1.024 0.902 0.964 


Table 3 presents similar data, except that in this case the 
average range of sigma-scores is considered rather than the 
average mean difference. These average ranges may be seen 
to vary from 1.85 to 2.31. Assuming again that the average 
individual varies about the mean of his class group, one would 
find that this average individual has a range in sigma-scores 
covering from about sixty-four to seventy-five per cent of the 
normally anticipated range. 

Statistically significant differences may be found in the com- 
parisons of these means, differences on the whole comparable 
to those found when the sums of all possible differences for the 
various groups are compared. Probably more striking than . 
the rather slight differences which this study has uncovered, 
however, is the marked degree of similarity found in the average 
extent of variability in the three groups selected according to 


TABLE 3.—MEANS AND STANDARD DEVIATIONS OF THE RANGE 
IN SIGMA-SCORES OF THE THREE GROUPS 














Boys Girls Both 
Groups 
Mean | SD | Mean} SD | Mean} SD 
High Intelligence......... 2.13 |0.85) 1.86 (0.71) 1.91 |0.78 
Average Intelligence....... 1.85 (0.68) 1.91 |0.81) 1.89 (0.75 
Low Intelligence.......... 2.31 |0.99) 2.09 0.82) 2.19 (0.90 
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intelligence and in the two sex groups. It would appear that a 
considerable amount of individual variability may be anticipated 
in educational achievement as it has been studied in this investi- 
gation, irrespective of either the sex or the intelligence level of 
the individual child. 

To summarize briefly, then, within the limits of the tests used, 
the method of selecting subjects, and the particular statistical 
analyses employed, the following statements concerning indi- 
vidual variability would seem to be justified. 

1) Individual variability in educational achievement is some- 
what greater in the group of low intelligence than in either of the 
two other groups when averages are considered and when the 
90th, 75th, 50th, 25th, and 10th percentile points are studied. 

2) Boys of low intelligence show slightly less homogeneity 
with respect to the size of individual variability scores than do 
the other groups studied. Among these other groups little 
difference was found. 

3) Only negligible differences were found with respect to sex 
either in the average amount of individual variability or in the 
degree of dispersion of scores within a sex group. 

4) Irrespective of intelligence level or of sex, a wide extent of 
individual variability is revealed, the average mean difference 
being close to one sigma-unit, and the range of sigma-scores 
for the individual being on the average approximately two 
sigma-units. 

The findings of this study are largely negative in nature. No 
marked or consistent relationships have been discovered with 
respect to intelligence and individual variability in educational 
achievement. In view of the demonstrated existence of indi- 
vidual variability, such findings would suggest the advisability 
of investigating other factors than intelligence in the individual’s 
development which might show a relationship with individual 
variability. 

On the positive side, probably the most significant finding of 
the present investigation is that of the considerable extent of 
individual variability to be anticipated upon the part of a subject 
population such as that studied in the present analysis. If it is 
true that one might expect the average child in a group to vary 
over the average range found in this study, this hypothetical 
child would vary from a point at which he surpasses less than 
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sixteen per cent of his class in one performance to a point at 
which he surpasses more than eighty-four per cent in another 
performance. Pertinent also is the fact that such variability 
may be anticipated irrespective of either the intelligence level 
or the sex of the child. Certainly no support is offered for the 
fatalistic view that a child who is a failure in one phase of his 
school work is a failure in all phases, or that, conversely, the 
child who excels in one school subject may be expected to excel 
in all of them. Rather, evidence is advanced that, insofar as 
the factors considered in the present investigation are concerned, 
every school child must be considered as a unique individual with 
unique capacities and potentialities of achievement. 
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ATTITUDES TOWARD WAR AS EXPRESSED BY 
AMISH AND NON-AMISH CHILDREN 


T. L. ENGLE 
Indiana University Extension Center, Fort Wayne, Indiana 


It is to be hoped that the present opportunity for studying 
children’s attitudes toward war in time of war is one which 
psychologists will not have again for many years. Already there 
is a considerable body of literature pertaining to children in their 
relationships to war.? It is imperative that present attitudes be 
studied not only for their immediate value but so that they may 
serve as a basis for comparison with studies to be made in time 


of peace. 


PURPOSE OF THE RESEARCH 


In a previous study the writer has found some evidence of 
general attitude differences between Amish and non-Amish 
children.t_ Members of the Amish sect are by faith conscientious 
objectors to war, the objection being made not only in times of 
war but also in times of peace. The purpose of the present 
study has been to compare the attitudes toward war of Amish 
children, with their background of religious teaching on the 
subject of war, with the attitudes of non-Amish children who do 
not necessarily have this background of religious teaching and 
who are open to the bombardment of war propaganda through 
such mediums as the radio and movies. The non-Amish children 
of the present study are from the same general socio-economic 
level as the Amish children. .Attitudes have been measured by 
two techniques, free expression in a written theme and by means 
of a standardized attitude-toward-war scale. The data were 
obtained in the fall of 1943. 


SUBJECTS 


The subjects were pupils in the seventh and eighth grades of 
fourteen schools in northeastern Indiana. Of the total of two 
hundred ninety-four children, one hundred thirty-four belonged 
to the Amish sect and one hundred sixty were non-Amish, the 
sex distribution being as indicated in Table I. The entire popu- 
lations of the seventh and eighth grades were used in all schools 
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except one. In that school the population of an already formed 
subgroup was used. 

For those not familiar with the Amish sect it should be said 
that the members wear a distinctive garb, do not have radios in 
their homes, do not own automobiles, do not attend moving 
picture theaters, and in other ways lead what they call an humble 
life. Practically all Amish people are farmers and the fourteen 
schools in the present study were either rural or small-town 
schools. Most of the non-Amish children lived on farms, also. 

Of the one hundred thirty-four Amish children, six boys and 
seven girls belonged to a branch of the sect which does not wear 
the garb. However, the members of this branch are also 
conscientious objectors to war and so their children have been 
included in the Amish groups. 

Seven boys and ten girls came from three schools in which 
there were only Amish children in the seventh and eighth grades, 
the remaining one hundred seventeen Amish children attending 
schools in which there were both Amish and non-Amish children 
in these grades. 


TECHNIQUES OF THE RESEARCH 


In order to secure a free expression of attitudes toward war 
the subjects were asked to write a theme on the subject, ‘‘How 
the war affects me.”” This theme subject was suggested by and 
is the same as that used by Mandel Sherman in his study of the 
attitudes toward war of Chicago children of high-school age.* 

In the present study the themes were written during the regular 
English (or other) period and under the direction of the regular 
classroom teacher. The children were not told the subject of 
the theme in advance. They were encouraged to begin writing 
promptly and to write fully on the subject. In making the 
assignment teachers emphasized the fact that each child had a 
right to express whatever ideas he had about how the war 
affected him and that his theme would not be criticised because 
his ideas might not happen to agree with those of the teacher. 

Teachers were given permission to read the themes as part of 
their regular instructional work if they cared todo so. However, 
they were not permitted to place marks, corrections, or personal 
comments on the themes because of the possible influence which 
such marks, corrections, or comments might have on those who 
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were to rate the themes for attitudes toward war. Lists indi- 
cating which children were Amish and which were non-Amish 
were sent to the writer, but teachers made no notations on the 
themes to indicate the religious affiliations of the children because 
of the possible influence which such notations might have on 
those who were to rate the themes. It should be said, however, 
that in some cases the religious affiliation of the child could be 
determined from the content of the theme. 

The themes were read and rated by five graduate university 
students. Four of the students were experienced public school 
teachers, the other was an editor of religious publications and 
had taught in a small college. All were taking a course in the 
Psychology of Personality at the time they rated the themes. 
As part of the work in this course the measurement of attitudes 
had been discussed, including reference to and discussion of the 
measurement involved in the present study. 

A seven-point rating scale was prepared ranging from direct 
antagonism to war with a weight of one to a strongly favorable 
attitude toward war with a weight of seven. Each judge 
worked entirely independently of the others and had no access to 
the ratings of the others. It was somewhat difficult to rate some 
of the themes because of the limited expression of ideas, but each 
theme was rated to the best of each judge’s ability. For seventy- 
six of the two hundred ninety-four themes (25.9 per cent) the 
five judges agreed perfectly in their judgments; for ninety-eight 
additional themes (33.3 per cent) the judges did not disagree by 
more than one point in their judgments. Combining these two 
figures we have the fact that for one hundred seventy-four of the 
two hundred ninety-four themes (59.2 per cent) the five judges 
were either in perfect agreement or did not disagree by more than 
one point on a seven-point rating scale. 

An additional measurement of attitudes toward war was 
secured by asking the children to check statements on the Ruth 
C. Peterson Attitude Toward War, Scale No. 34, Form A, Uni- 
versity of Chicago Press (Edited by L. L. Thurstone). This scale 
was not developed for use with children, and teachers were asked 
to indicate whether or not children experienced difficulty with 
the wording of the scale. Three of the fourteen teachers reported 
some such difficulty. In cases where individuals did not know 
the meaning of a word the teacher was permitted to read the 
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child a definition from a dictionary but was cautioned to take 
care not to influence the child’s answer in any way. The 
attitude scale was checked by the children after they had written 
and handed in their themes. 

Teachers were instructed to impress upon the children the fact 
that the scale was not an examination in any sense, that there 
were no right and wrong answers, and that school marks would 
in no way be based on their answers. Teachers were permitted 
to tell the children that a teacher in a university was interested 
in how seventh- and eighth-grade boys and girls felt about war 
and that their papers would be sent to him. The children were 
warned not to change the wording of any statements on the scale. 
They were urged to check the statements promptly and to hand 
in the papers as soon as they had completed the checking. 

The scale consists of twenty statements. The subject is 
instructed to put a check mark in front of each statement with 
which he agrees, to put a cross in front of each statement with 
which he disagrees, and to place a question mark in front of a 
statement in case he cannot decide. The scales were scored 
according to the directions published by the author of the scale. 
The scoring is so done that the lower the score the more unfavor- 
able toward war is the attitude, whereas the higher the score the 
more favorable toward war is the attitude. 


ATTITUDES AS MEASURED BY THEMES 


The themes contained a great deal of material of interest to 
the psychologist. An analysis of them will be presented in a 
subsequent paper. 

As has been indicated, five judges rated each theme for attitude 
toward war. The mean of these ratings was taken as the score 
on each theme. Comparative data for Amish and non-Amish 
children are indicated in Table I. In reading this table it should 
be remembered that a rating of four indicated a neutral position 
with expression favorable toward war being balanced by expres- 
sion unfavorable toward war. It is of interest to note that non- 
Amish boys were more favorable toward war than non-Amish 
girls and that the same was true for Amish boys and girls. 
However, these differences are not statistically significant, the 
significance ratios being 1.00 and .95, respectively. The differ- 
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ences between Amish and non-Amish children are statistically 
significant. 


TABLE 1.—DISTRIBUTION BY SEX AND RELIGIOUS SECT OF 
ScorRES ON THEMES 


Number Diff. 

of Mean SE SE in SE D/SE 
Sex Sect Subjects Score Sample Mean Means Diff. Diff. 
Boys Amish 70 2.79 1.19 .14 
(158) Non-Amish 988 4.10 1.61 .17 /:3! -22 5.85 
Girls Amish 64 2.60 0.97 .12 
(186) Non-Amish 72 3.84 1.65 .20 '°* -2%8 5.94 
Both Amish 134 2.70 1.09 .09 128 16 7.97 


(294) Non-Amish 160 3.98 1.64 13 


It should be added that the distributions of scores for Amish 
children, both boys and girls, were distinctly skewed. Ninety- 
two (68.7 per cent) of the scores of all Amish children fell within 
the range 1.9 to 2.7, indicating an attitude definitely unfavorable 
toward war. The remainder of the scores were scattered. The 
distributions of scores for non-Amish children both boys and 
girls, were distinctly bimodal. Sixty-two (38.8 per cent) of the 
scores of all non-Amish children fell within the range 1.9 to 2.7. 
Sixty-one (38.1 per cent) of the scores of all non-Amish children 
fell within the range 5.1 to 5.9, indicating an attitude mildly 
favorable toward war. The remainder of the scores were 


scattered. 
ATTITUDES AS MEASURED BY AN ATTITUDE SCALE 


Comparative data for Amish and non-Amish children based 
on the Attitude Toward War scale are indicated in Table II. It 
is to be noted that non-Amish boys were more favorable toward 
war than non-Amish girls, but the difference is not statistically 
significant, the significance ratio being 1.49. Amish girls were 
slightly more favorable toward war than Amish boys but the 
significance ratio is only .50. Non-Amish boys were more 
favorable toward war than Amish boys and the difference is 
statistically significant. Although non-Amish girls were more 
favorable toward war than Amish girls the difference is not 
statistically significant. It is of interest to note that all the mean 
scores in Table II fall within the classification ‘Moderately 
opposed to war’ as that classification was determined for adults 


in the inter-war period. 
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TABLE II.—DIsTRIBUTION BY SEX AND RELIGIOUS SECT OF SCORES 
ON THE ATTITUDE TOWARD WAR SCALE 


Diff. 
Mean SE SE in SE D/SE 
Sex Sect Score Sample Mean Means Diff. Diff. 
Amish 3.83 1.03 2 
BOyS-  NomAmish' 4.80 1.19 119 °* 3! 4-9 
; Amish 3.92 1.08 .14 
Gitls  Non-Amish 4.04 1.08 .13 “2% -19 0-64 
Both Amish 3.87 1.06 .09 31 13 2.44 


Non-Amish 4.18 1.11 .09 


In addition to the comparison of scores indicated above a 
detailed study was made of responses to the various statements 
on the attitude scale. It was found that for three statements 
differences in responses of total Amish and total non-Amish 
children were of statistical significance. The statement, ‘‘ War 
brings out the best qualities in men’”’ was agreed to by 24.6 per 
cent of the Amish children and by 44.4 per cent of the non-Amish 
children. The difference of 19.8 per cent has a significance ratio 
of 3.60. ‘‘Under some conditions, war is necessary to maintain 
justice”? was agreed to by 53.7 per cent of the Amish and by 
71.4 per cent of the non-Amish children. The difference of 17.7 
per cent has a significance ratio of 3.16. ‘“‘Although war is 
terrible it has some value”’ was agreed to by 51.5 per cent of the 
Amish and by 68.3 per cent of the non-Amish children. The 
difference of 16.8 per cent has a significance ratio of 2.95. 

The statement on the scale showing the greatest difference 
between total boys and total girls was, ‘“‘War brings out the best 
qualities in men.” This statement was agreed to by 39.2 per 
cent of the boys and by 30.9 per cent cent of the girls. The 
difference of 8.3 per cent is not of statistical significance, the 
significance ratio being 1.48. 

The percentages of Amish and the percentages of non-Amish 
children agreeing with each of the twenty statements on the scale 
were computed. Only those statements agreed to by seventy- 
five per cent or more of either group and those agreed to by 
twenty-five per cent or less of either group will be indicated here. 

The statement, ‘‘War is a ghastly mess” was agreed to by 
90.1 per cent of the Amish children and by 91.4 per cent of the 
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non-Amish children. The corresponding percentages for each 
of the following statements were as follows: “International 
disputes should be settled without war,’’ 88.9 per cent and 92.5 
per cent; ‘‘It is good judgment to sacrifice certain rights in order 
to prevent war,” 79.9 per cent and 85.0 per cent; “The evils of 
war are greater than any possible benefits,” 76.1 per cent and 
74.5 per cent; “‘ War has some benefits; but it’s a big price to pay 
for them,” 73.2 per cent and 82.0 per cent; ‘‘ Pacifists have the 
right attitude, but some pacifists go too far,” 63.3 per cent and 
75.0 per cent. 

The statement on the scale which was agreed to least was, 
‘War is glorious.”” One non-Amish boy and one Amish boy 
agreed with this statement but no girls agreed with it. The 
statement, ‘‘There can be no progress without war” was agreed 
to by 9.7 per cent of the Amish and by 5.0 per cent of the non- 
Amish children. The corresponding percentages for each of the 
following statements were as follows: ‘‘War is the only way to 
right tremendous wrongs,’’ 14.9 per cent and 13.7 per cent; “I 
never think about war and it doesn’t interest me,” 28.4 per cent 
and 18.1 per cent; ‘‘War brings out the best qualities in men,” 
24.6 per cent and 44.4 per cent. Although the significance ratio 
for the difference is only 2.07, it is of special interest to note that 
a higher percentage of Amish than of non-Amish children agreed 
with the statement, ‘‘I never think about war and it doesn’t 
interest me.’ Of course, agreement with this statement, 
especially in time of war, indicates an escape mechanism. 


CORRELATION BETWEEN MEASUREMENTS MADE BY 
THE TWO TECHNIQUES 


Two techniques have been used to measure children’s attitudes 
toward war. Scores on the themes were correlated with scores 
on the attitude scale for Amish, non-Amish and total groups. 
For Amish children, having a distinctly skewed distribution in 
scores on themes, the coefficient of correlation was positive .045, 
standard error .086. For non-Amish children, having a distinctly 
bimodal distribution of scores on themes, the coefficient of 
correlation was positive .309, standard error .072. The coefficient 
of correlation for all themes and all attitude scales was positive 
.223, standard error .055. 
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SUMMARY 


Attitudes toward war in time of war of Amish and non-Amish 
seventh- and eighth-grade children have been measured by two 
techniques. Themes written by Amish children on the subject, 
“‘How the war affects me”’ were rated by five judges as signifi- 
cantly more unfavorable toward war than themes written on the 
same subject by non-Amish children from the same general 
socio-economic level. For the Amish children the distribution 
of scores on the themes was distinctly skewed in the direction of 
indicating an unfavorable attitude toward war. For non-Amish 
children the distribution of scores on themes was bimodal, one 
mode indicating an attitude unfavorable toward war, the other 
indicating an attitude mildly favorable toward war. Scores on 
the themes suggested that girls were more unfavorable toward 
war than boys although the differences were not statistically 
significant. 

On an attitude-toward-war scale non-Amish children were 
found to be more favorable toward war than Amish children, the 
difference being statistically significant for boys. As measured 
by adult standards in time of peace, all mean scores fell within a 
range suggesting an attitude of moderate opposition to war. 
Scores on the attitude scale suggested that non-Amish girls are 
more unfavorable toward war than non-Amish boys although 
the difference is not statistically significant. There was but 
little correlation between scores obtained on the themes and > 
scores obtained on the attitude scale. 

Amish children indicated less acceptance than did non-Amish 
children of the statements that war brings out the best qualities 
in men, that under some conditions war is necessary to maintain 
justice, and that although war is terrible it has some value. 
There was considerable agreement among both Amish and non- 
Amish children that war is a ghastly mess, that international 
disputes should be settled without war, that certain rights should 
be sacrificed in order to prevent war, that the evils of war are 
greater than any possible benefits, and that although war has 
some benefits it is a big price to pay for them. Both Amish and 
non-Amish children agreed that war is not glorious, that there 
can be progresss without war, and that there are ways to right 
tremendous wrongs other than by war. 
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In the present study the subjects were all children from rural 
or small-town homes. A comparison with the attitudes of city 
children whose parents are engaged in highly remunerative war 
work would be of interest. The writer hopes to make a study 
comparable to the present one after peace has been achieved. 
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RECITATION OR RECALL AS A FACTOR 
IN THE LEARNING OF LONG PROSE SELECTIONS 


H. A. PETERSON! 
Illinois State Normal University 
THE PROBLEM 


In 1917 A. I. Gates published a study of the effect on learning 
of increasing the proportion of the learning time which is devoted 
to recitation as compared with that used for reading. While he 
used both nonsense syllables and short prose passages, we shall 
consider only his work with the latter, as they are the more similar 
to study in school. His subjects were about three hundred 
children from the third to the eighth grades and some adults, but 
we shall consider the results from the eighth grade, the results 
from the other subjects being similar. The prose paragraphs 
were short 170-word biographical sketches from Men of Science 
and Who’s Who, such as the following: 


James Church, born in Michigan, February 15, 1869. Studied in 
Munich, and later studied forestry and agriculture. Director of 
Mt. Rose Weather Observatory in 1906. Studied evaporation of 
snow, water content, and frost. 


Nine minutes were allowed for the study of each. The dis- 
tinctive feature of the experiment was the variation in the propor- 
tions of the learning time allowed for reading and recitation. 
The equivalent group method was used, and there were six 
divisions of the learning time tried. If we let P stand for reading 
and R for recitation (a word here synonomous with recall), they 
were: 1) P 100 per cent, R 0; 2) P 80 per cent, R 20 per cent; 3) 
P 60 per cent, R 40 per cent; 4) P 40 per cent, R 60 per cent; 5) 
P 20 per cent, R 80 per cent; 6) P 10 per cent, R90 percent. In 
reciting the subjects held printed copies of the selection before 
them and looked over the top, attempting to recall the facts; 
when they could not proceed, they consulted the copy. Thus 
what is called ‘recitation’ was in reality a combination of recita- 
tion and reading. 





1The author wishes to express his indebtedness to Dr. Stanley Marzolf 
of the Illinois State Normal University for assistance in the statistical treat- 
ment of the results, and to Dr. C. F. Malmberg and Miss Ethel Burris of 
the same institution for aid in conducting the experiment. 
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When the amount recalled in reading alone is taken as 100, the 
amounts recalled in the other situations increased, as the per- 
centage of time devoted to recitation increased, up to about 
120 per cent when R was 60 per cent, but did not increase after 
that. This wasin immediate recall. In the tests four hours later 
the amounts recalled through the use of a high percentage of 
time devoted to recitation were similar but larger, standing 
eventually at 162 per cent when three-fifths of the time was 
devoted to recitation. The gains from the use of recitation were 
greater for nonsense syllables than for prose selections. The 
results obtained by Gates were confirmed by Forlano in 1936, 
and have been accepted as a basic principle of effective learning 
both in general and in educational psychology. This principle 
might be stated thus: After a certain amount of reading, it is 
more profitable to recite than to continue reading. 

What strikes the student of psychology about this experiment 
is its inapplicability to ordinary study or learning situations. It 
pays no attention to the proportion of the learning time needed 
to get an understanding of the passage, or even to read it once. 
The passages Gates used were very short, and could be read 
through many times in nine minutes. They could even be read 
through many times and still be recited a number of times. But 
suppose the passage is from five to fifteen pages long and the 
learning time is only enough to read it through once or twice, 
which is an ordinary high school or college study situation. If 
one-half of the study time were devoted to reciting, the student 
might not have time enough to read the entire passage through 
once, and he would have no use for that much time for reciting. 
For it is assumed that he seeks only the important ideas, and not 
every little detail. In the Gates investigation one score point was 
allowed for each phrase, such as the person’s name or where he 
was born. As a matter of fact the selections used by Gates, in 
the way they were used, were very far from being typical school 
learning situations. They were to a high degree rote learning. 
In the usual school study situation it would be impracticable and 
unwise to devote 60 per cent or even 40 per cent of the learning 
time to reciting. ‘Twenty per cent would probably be nearer to 
what is needed. 

The criticism above may be summarized by saying that Gates’ 
conclusions are undoubtedly correct for brief rather disconnected 
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passages, but in long meaningful passages one must take into 
account the proportion of learning time that is necessary to read 
the passage and understand it. Up to that point reading is more 
important than reciting. Gates’ results obtained from semi-rote 
matter have been unwarrantedly extended by writers in educa- 
tional psychology to sense matter in which numerous readings 
and recitations are impracticable. 


DESCRIPTION OF THE EXPERIMENT 


Our purpose was to set up a typical college study situation in 
which the proportion of reading time to recall time was varied in 
a manner that was appropriate to long prose passages, and test 
the results by means of objective tests, both in immediate recall 
and after two weeks. The subjects were about one hundred five 
college sophomores. We emphasize the fact that long prose 
passages were used, for this is the distinctive feature of the experi- 
ment. Brief excerpts from the three passages used are given 
together with their approximate length: 

Excerpt from passage entitled ‘‘Hannibal.’”’ Summarized 
from W. 8S. Davis, Readings in Ancient History. Length, about 
2600 words or 7.4 pages: 


Hanno, leader of the party opposing (the choice of Hannibal as 
commander) then said, ‘‘ Hasdrubal seems to ask what is reasonable, 
still I think his request ought to be refused.”’ He then went on to 
argue that rearing Hannibal in the expectation of a great command, 
like that in Spain, would fit him only to play the tyrant over the: 
Carthaginians. He concluded by saying, ‘‘To my mind this young 
fellow should be kept at home, under the restraint of the laws and 
the power of the magistrates, and taught to live on an equal footing 
with the rest of the citizens.”’ 


Excerpt from passage entitled ‘‘The Nature and Possibilities 
of Tropical Agriculture,’’ based on an article by E. Huntington 
and 8. W. Cushing, Jour. of Geog., 1919, 18: 341-48. Length 
about 2200 words or 6.3 pages: 


A second great handicap in equatorial rain forests is the difficulty 
of keeping domestic animals even in the clearings. Noxious insects 
plague them almost as badly as they plague men. For example, in 
large parts of tropical Africa the bite of the tsetse fly not only causes 
the deadly sleeping sickness in man, but is fatal to domestic animals, 
except perhaps the donkey. Even if animals escape disease, they 
rarely thrive, for grain cannot be raised and what little grass can 
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grow among the luxuriant trees is usually so rank and coarse that it 
is not nutritious. 


Excerpt from passage entitled ‘‘The Policies of Labor Unions” 
based on Selected Readings in Economics by C. J. Bullock, and 
Labor Problems, 1940 ed., ch. 19, by G. S. Watkins and P. A. 


Dodd: 


If the whole body of workers of a given kind can be brought into 
the union, so that the union can meet the employers as the repre- 
sentative of the whole, the position of the worker will be greatly 
strengthened. The fear that if he refuses to accept certain terms, 
another man will be employed in his place is removed. His igno- 
rance of market conditions will be partly remedied, both through the 
combinations of the knowledge of all the workers in the union, and in 
some cases by the broader outlook which the union officials, partly 
or wholly exempted from daily application to manual work, may be 
able to obtain. The whole matter of bargaining can be put into the 
hands of the most skillful; and the officers and leaders may develop 
a skill in bargaining, by constant practice, comparable to that of 
their opponents. 


The reading times of these selections were determined by hav- 
ing them read under normal conditions by five sophomores not in 
the experiment, and taking the average reading time for each 
selection. The total learning times in all cases were 150 per cent 
of the reading times, this being a typical college study situation. 
Three proportions of reading to recalling were selected upon which 
to experiment: in Method I all the time was devoted to reading 
and none to recalling; in Method II two-thirds of the time was 
devoted to reading and one-third to recalling; and in Method III 
one-half of the time was devoted to reading and one-half to recall- 
ing. The clock time for the three selections worked out as 


follows: 


READING TIMES AND TOTAL LEARNING TIMES OF THE SELECTIONS 


Selection Reading Time Total Learning Time 
“Hannibal” 8’ 48’ 13’ 20” 
“Tropical Agriculture” 8’ 4” 12’ 6” 
“Labor Unions” aR ad 10’ 36” 


It is important to notice that when the proportions of reading 
to recalling given above are stated in terms of the reading times 
of the selections, they work out as follows: in Method I all of the 
learning time, which is one hundred fifty per cent of the reading 
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time, is given to reading and none to recalling; in Method II two- 
thirds of the learning time or one hundred per cent of the reading 
time is given to reading and one-third of the learning time (equal 
to fifty per cent of the reading time) is given to recalling; in 
Method III half of the learning time (or seventy-five per cent 
of the reading time) is given to reading and half to recalling. 
In the last method the subject will have to read a little faster 
than the average student in order to get through the passage once, 
but will have more time than usual for recalling. Or, he may 
not finish reading the passage once. 

Attached to the subject’s copy of each learning passage were 
the directions for learning it, including a statement of how the 
time was to be divided. They were not told the reading time. 
After expiration of the learning time, the passages were collected. 

Retention was measured by means of three objective tests of 
twenty-four questions each, one for each passage. One score 
point was allowed for each question correctly answered. The 
same test was used in delayed as in immediate recall. Most of 
the questions were true-false and multiple-choice, but a few were 
completion in which credit was given for the correct idea. Exam- 
ples are the following: 

1) Many of the greatest senators believed that Hannibal 
should not be trained for military service because: (a) They 
feared he would be a tyrant over the Carthaginians; (b) Hannibal 
was not physically strong; (c) They wanted him to be a senator. 

2) The grass that grows among the luxuriant trees of the rain 
forest is usually so rank and coarse that it is of little nutritional 
value. (T-F) 

3) Unions are said not to favor grading workers at the same 
job according to their skill or speed, and paying them accord- 
ingly, because: (a) it would promote discord in the union; (b) it is 
impossible; (c) it would lower the average wage; (d) the examin- 
ing would tend to become a state function and hence lessen the 
power of the unions. 

One hundred five subjects were divided into three groups of 
thirty-five each equated on the basis of their scores on the 
Teachers College Aptitude Test given the year before. The 
experiment was arranged as a Latin square.? All three selections 





2 Fisher, R. A. The Design of Experiments. (Second Edition) London: 
Oliver and Boyd, 1937. 
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were learned by each method. Each group used each of the three 
methods, with a different selection for each method. 


RESULTS AND CONCLUSIONS 


It is well to remind the reader at this point that the learning 
time in all cases was fixed at a fifty per cent increase of the reading 
time, as determined by averages of five sophomores selected by 
chance and not in the experiment. 


TABLE 1.—COMPARISON OF THE THREE METHODS: IMMEDIATE 


RECALL 
Group Method I Method II Method III 
(all reading, norecall) (24 reading, 14 recall) (14 reading, }¢ recall) 
A Trop. Agric. 16.88* Labor Unions 16.13 Hannibal 16.31 
B~ Labor Unions 14.76 Hannibal 17.54 Trop. Agric. 13.20 
C Hannibal 16.45 Trop. Agric. 15.31 Labor Unions 12.30 
Means 16.03 16.33 13.94 
* Mean scores of the groups. 


The results are shown in Table 1. The entries in any one row 
represent the mean scores for the methods and selections for each 
of the three groups. It is clear from the means of the methods 
that the first two methods are superior to the third, but there is 
not much difference between the first two. 


TABLE 2 
Sum of 
Source of variation D. f.* squares Variance F 
Methods 2 10.19 5.09 69.9f 
Groups 2 4.93 2.46 33.8 
Selections 2 8.84 4.42 60.6 
Residual 2 15 .07 
Total 8 24.11 


* Degrees of freedom. f Significant at the five-per-cent level. 


Data for determining the significance of these results are shown 
in Table 2.2 The most effective distribution of the learning time 
in immediate recall is to give two-thirds to reading and one-third 
to recalling; to give all the time to reading and none to recalling 
is only slightly inferior; and the least efficient division is the one 
which gives the least time to reading and the most to recalling. 
The third result does not mean that recall is ineffective, but that 
not enough time was allowed for reading and assimilation. For 





2 Op. cit. 
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to allow only one half of the learning time to reading is to allow 
only three-fourths of the official reading time for reading. One 
group in this division of the learning time was asked to mark the 
place in the passage where they stopped reading. Thirty-five 
per cent of the group failed to finish reading the passage. This is 
undoubtedly the reason why this time division was the least 
efficient. The excellent results of devoting all the time to read- 
ing when the learning time is only one hundred fifty per cent of 
the reading time are noteworthy. As between this method and 
Method II, it still remains to be ascertained whether in these 
circumstances a smaller percentage than one-third for recall is 


better than one-third. 


TABLE 3.—COMPARISON OF THE THREE METHODS: DELAYED 
RECALL 
Group Method I Method IT Method III 

(all reading, norecall) (24 reading, recall) (34 reading, recall) 

A Trop. Agric. 16.45* Labor Unions 16.51 Hannibal 14.97 

B_ Labor Unions 16.18 Hannibal 15.97 Trop. Agric. 14.65 

C Hannibal 15.96 Trop. Agric. 15.47 Labor Unions 13.76 

Means 16.20 15.98 14.46 
* Mean scores of the group. 


The results for delayed recall are shown in Table 3. It is to 
be read in the same manner as Table 1. The significance of 
these results is shown in Table 4. 


TABLE 4 
Sum of 
Source of variation D.f. squares Variance F 
Methods 2 5.388 2.69 24.4** 
Groups 2 1.20 .60 5.45 
Selections 2 .03 015 7.33 
Residual 2 22 11 
Total 8 6.83 


** Not significant at the five-percent level. 


In delayed recall (Table 3) the most effective distribution of 
the learning time is to give all to reading and none to recall, but 
the plan of giving two-thirds to reading and one-third to recall 
is only slightly inferior. Giving one-half of the time to reading 
and one-half to recall is, again, the least efficient, but, as shown 
by Table 4, the difference between it and the other two methods 
is here not significant at the five-per-cent level. It is, therefore, 








Recitation or Recall as a Factor in Learning 227 


certain that the exchange of position in Methods I and II from 
immediate to delayed recall is not significant. 

Undoubtedly there is no such thing as pure reading with no 
recall. We made the directions as iron-clad as we could. Cer- 
tainly there was a great difference in where the emphasis was 
placed as between an all-reading-no-recall situation and the 
other two. 

Why do our results differ markedly from those of Gates? The 
situations are different. In the one case the material is a short 
series of items among which there is not much connection in 
meaning. We have a prompting situation introduced into recall. 
The learner tries to recall as many of the items as he can. When 
he cannot proceed, he prompts himself by a glance at the text. 
In this way he works only on what is missing. In the other 
situation the material is an enormously longer highly meaningful 
passage from the study of which the learner has tried to extract 
the principal meanings. In recall he does not have a copy of the 
text before him, nor does he have the questions of the test, so 
he tries to recall wholly from memory what he thinks are the 
principal ideas in the passage he has just studied. It is not a 
prompting situation. It could not be said that he works only 
on what is missing. Thus, while it is a wholly natural study 
situation, it is not the same situation as in the Gates study. 
Our whole claim is that conclusions from the Gates study have 
been extended to situations to which in the large they do not 


apply. 
COMPARISON OF THE RESULTS WITH CURRENT TEXTS 


The importance of this investigation becomes obvious when 
we consider that in nearly all current texts in educational psy- 
chology the results of the Gates study are unwarrantedly extended 
either explicitly or by implication to the usual study situation 
in school subjects. We quote from a few: 

“Tf a passage of reading matter is studied by first reading it 
through and then reciting to oneself . . . , the retention will be 
appreciably better than if one merely reads and rereads without 
any effort at recitation. In fact the larger the percentage of 
time spent in reciting, the better the retention, as shown by the 
table below” (Gates’ results then quoted).°® 





* McGeoch, J. A. The Psychology of Human Learning, 1942, p. 199. 
5 Pressey, S. L. Psychology and the New Education, 1933, p. 408. 
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Another writer after quoting Gates’ results says: ‘‘ Note that 
by using the recitation method one can increase his efficiency in 
learning sense-material over twenty per cent. The relative 
amounts of reading and recitation for all types of material were 
not determined, but this experiment would imply that more 
time may profitably be given to the recitation than to the reading. 
In sense material as much as three-fifths of the time may be 
profitably allotted to the recitation. ® 

Commins remarks that in general it has been found that it 
(reciting) is most advantageous when introduced not too early.’ 

‘“‘One-half to even over three-fourths of the time apportioned 
to a lesson can be used with profit in recalling or reciting the 
content of a lesson.’”® 

Valentine quotes from Gates’ study to show that he (Gates) 
had in mind ‘working under school conditions and with ordi- 
nary schoolroom methods of attack.’’ His own ‘‘applications” 
show that he also had the same situation in mind.’ In fact, the 
Gates, Jersild, McConnell, and Challman text says, after quoting 
the results of Gates’ 1917 study: ‘‘A study of the table will 
disclose several facts: (1) the greater the amount of time devoted 
to recitation, the greater the percentage of the lesson recalled. 
Of course some time must be spent at the start in reading the 
material.’’!° 

Our own experiment does not deny the improvement in learn- 

ing when recall is introduced, but makes the points: (1) that in 
determining the proportion of the entire learning time to be 
devoted to learning and to recall, one must take into considera- 
tion the amount of time necessary to read and to understand the 
passage; (2) in the ordinary study situation a smaller proportion 
devoted to recall than that indicated by the Gates experiment 
will usually be found advisable; and (3) where the learning time 
is not greatly in excess of the time required to read the passage 
through once, devoting all of it to reading is substantially as 
effective as giving one-third to recall. 





6 Jordan, A. M. Educational Psychology, rev. ed., 1933, p. 180. 

7 Commins, W. D. Principles of Educational Psychology, 1937, p. 412. 

8 Sorenson, H. Psychology in Education, 1940, p. 289. 

® Valentine, W. L. Experimental Foundations of General Psychology, 1941, 
pp. 342, 347. 

10 Gates, A. I., Jersild, A. T., McConnell, T. R., and Challman, R. C. 
Educational Psychology, 1942, p. 389. 





RELATIONSHIP BETWEEN KUHLMANN- 
ANDERSON INTELLIGENCE TESTS AND 
ACADEMIC ACHIEVEMENT IN GRADE IV* 


MILDRED M. ALLEN 
School Psychologist, New Rochelle, N. Y., Public Schools 


The purpose of this study was to determine the relationship 
between the Kuhlmann-Anderson Intelligence Test, fourth- 
grade battery, and educational achievement as measured by the 
New Stanford Achievement Test, Form W. Relationships were 
determined between total scores, and also between the com- 
ponent parts of each test. Both tests were administered at 
approximately the same time in Grade IV near the beginning of 
the school term in October, 1939. 

The subjects used in this study were three hundred and 
twenty-seven pupils from ten elementary schools in New Rochelle, 
New York. 

Correlations between the Kuhlmann-Anderson scores and the 
various subtests of the New Stanford Achievement Test are 
shown in Table I. 

The coefficients in the above table, exclusive of those involving 
EQ, range from .51 to .78 with a median of .65. Ten of the 
twenty-seven coefficients are .70 or higher, twelve range between 
.60 and .69, and five fall below .60. All of the coefficients 
involving MA (exclusive of that with EQ) are higher than the 
corresponding coefficients in which IQ (or Pc. Av.-Per cent of 
Average Development) is one variable. The or values are of 
such magnitude that the differences between coefficients involv- 
ing MA, and those involving Pe. Av., approach statistical 
significance. The critical ratio for the differences of the coeffi- 
cients, .70 (MA and arithmetic reasoning) and .60 (Pc. Av. and 
arithmetic reasoning) is 2.5. If the coefficients involving MA 
were not so consistently higher than those involving IQ and 
Pe. Av. the differences might be ignored. Since all are higher, 
and since some are significantly higher, it may be assumed that 
the MA score obtained from the Kuhlmann-Anderson Intelli- 
gence Test possesses somewhat greater value than the IQ or 
Pe. Av. as an indicator of educational achievement. Thecorre- 





* Part of a Doctor’s dissertation completed at New York University, 
Graduate School of Education, 1940. 
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TABLE I.—COoOEFFICIENTS OF CORRELATION BETWEEN INDICES OF 
INTELLIGENCE DERIVED FROM THE KUHLMANN-ANDERSON 
INTELLIGENCE TEST, AND SUBTESTS OF THE NEW 
STANFORD ACHIEVEMENT TEST IN GRADE 4 














Kuhlmann-Anderson Test— 
Stanford Achievement Test— Grade 4 
Grade 4 

MA IQ Pe. Av. 
Paragraph Meaning........... .72 .68 .65 
Word Meaning............... . 64 .61 .57 
Reading Average.............. 71 .68 .65 
Arithmetic Reasoning......... ’ .70 .65 .60 
Arithmetic Computation....... .57 .53 51 
Arithmetic Average........... .70 .66 .61 
Ra ae a ck aw oie .63 .62 .57 
Total Average................ .77 .74 .70 
Educational Age.............. .78 .75 71 
Educational Quotient.......... .74 . 84 .80 











lations involving IQ are in each case higher than the correspond- 
ing coefficients for Pc. Av. scores. These differences are so 
slight that they may be ignored. 

The median coefficient of .65 of Table I (excluding that involv- 
ing EQ) has a corresponding k (coefficient of alienation) of 
.7599 which indicates a reduction in errors of prediction (were 
these data to be used for prediction at such short intervals) of 
twenty-four per cent. The k for the highest coefficient of this 
table (the .78 between MA and EA) is .6258, which gives an 
index of prediction efficiency of 37.42 per cent. An appreciable 
reduction of errors in predicting EA would occur by using MA 
as the predictive measure. 

A study of the correlations between EQ and MA, IQ and 
Pe. Av., reveals relatively high relationships of .74, .84, and .80, 
respectively. These correlations are relatively high, which 
indicates greater reliability of prediction when tests are given 
at approximately the same time. It is evident that the EQ is 
the most predictable score obtained from the Stanford Achieve- 
ment Tests, and both IQ and Pc. Av. are better predictors than 
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the MA. This high coefficient suggests the possibility of a 
spurious correlation, since both sets of measurements, (i.e., the 
IQ and Pe. Av.) have the same divisor; namely the CA. For 
example, the correlation of .74 between MA and EQ is not 
statistically higher than coefficients for MA and paragraph 
meaning, reading average, arithmetic reasoning, arithmetic aver- 
age, total average score, or EQ; whereas the correlations for IQ, 
or Pe. Av. and EQ are appreciably higher than any other IQ or 
Pec. Av. coefficients. 

An additional analysis of relationships between Kuhlmann- 
Anderson Intelligence Tests in Grade 4, and the New Stanford 
Achievement Test. Advanced Examination, was made. The 
problem was as follows: Do any of the subtests of the Kuhlmann- 
Anderson Intelligence Test have significant relationships with 
any of the subtests or derived scores on the New Stanford 
Achievement Test, Advanced Examination, in Grade 4? The 
coefficients of correlation among these variables are shown in 
Table II. 

Table II reveals three distinct groups into which the subtests 
may be grouped; also, one borderline or doubtful test; namely, 


TABLE I].—CoEFFICIENTS OF CORRELATION BETWEEN SUBTESTS 
OF THE KUHLMANN-ANDERSON INTELLIGENCE TESTS AND 
SUBTESTS OF THE NEW STANFORD ACHIEVEMENT TEST 
IN GRADE 4 





Stanford Achievement Kuhlmann-Anderson Subtests 








Subtests 
15| 16| 17| 18} 19} 20} 21 | 22) 23} 24 
Paragraph Meaning...... . 35) .39) . 57). 46) .31| .05) 69). 71). 69) .34 
Word Meaning.......... .32) .31) . 56) .39) . 24) .03) . 65) . 69) . 67). 23 
Reading Average........ . 33) .37| . 60) . 45) . 29) .04) . 70). 73). 70). 29 


Arithmetic Reasoning. .. .|.38).43).45).41).45). 15) .58).61).58).39 
Arithmetic Computation. |.30).31).35).43).35). 17). 45). 47). 46) .33 


Arithmetic Average...... . 38) . 43). 41) . 43) . 46). 15). 58). 60). 58) .38 
SC RRL SENT . 27) . 33) .48) 51). 24). 11).57|.63) 63). 26 
Total Average.......... . 38) .42) . 57). 53). 38). 10).71).75).71).35 
Educational Age........ .39) 42). 58) .51).38).11).71).76).73).36 


Educational Quotient... .|.37).42/. 56). 48) .37). 12) .67).73).69).31 
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Test 20. In the battery of Kuhlmann-Anderson Intelligence 
Tests for Grade 4, five tests (15, 16, 19, 20, and 24) may be 
considered non-verbal, in which test performance does not 
depend upon reading ability. Tests 17, 18, 21, 22, and 23 are 
verbal tests and test performance depends upon the pupil’s 
reading ability. All coefficients are less than .18 and well within 
3cr of .00 which suggests that Test 20 may have no true r’s above 
.00. 

A second group of tests made up of 15, 16, 18, 19, and 24 have 
approximately the same coefficients with all subtests of the 
Stanford Achievement Test. The apparent drop for word 
meaning, arithmetic computation, and spelling is more apparent 
than real since these differences are not statistically significant. 

The third group of tests (21, 22, 23) are significantly higher in 
their correlations with the three reading tests, (critical ratio 
ranging from 5 to 6), arithmetic reasoning, spelling, (except for 
tests 17 and 18), and the three derived scores; namely, total 
average score, educational age (EA), and educational quotient 
(EQ) with a critical ratio ranging from 3.3 to 7. The close 
relationship of these tests (21, 22, 23) with the basic subject- 
matter of the Stanford Achievement Test is doubtless due to the 
verbal nature of the content material. These tests (21, 22, and 
23) are not significantly higher than the tests of the second group 
with respect to the correlations with arithmetic computation, 
(critical ratio ranging from around 2.0 to 2.5). Test 17, the 
borderline test, falls between group two (tests 15, 16, 19, and 24) 
and group three (tests 21, 22, and 23), but is not significantly 
different from either. 

From this analysis, it appears that three tests, (21, 22, 23) and 
possibly a fourth, (17), have fairly high correlations with scores 
on the three reading measures (.65 to .70), and the three derived 
scores, Total Average, EA, and EQ (.67 to .76). Since these 
tests are equally as reliable as the whole battery for predicting 
scholastic achievement, it would be economical in time (adminis- 
tering and scoring) and cost, to use them in place of the whole 
battery. Arithmetic computation is not related highly with any 
of the ten subtests (.17 to .47) and practically all other coeffi- 
cients are in the range of .30 to .45, showing whatis customarily 
termed ‘slight to moderate positive relationships’ between 
Kuhlmann-Anderson subtests and scores on the New Stanford 
Achievement Test, Advanced Examination. 
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In considering the relationships between performance on the 
Kuhlmann-Anderson Intelligence Test and the New Stanford 
Achievement Tests an additional question appeared; namely, 
were the scores derived from the Kuhlmann-Anderson Tests 
(MA, IQ, and Pe. Av.) more closely related to Stanford Achieve- 
ment scores than the scores from individual subtests of the 
Kuhlmann-Anderson battery? Also, was there a possible com- 
bination of Kuhlmann-Anderson subtests which predict some 
achievement score better than it was predicted from the total 
Kuhlmann-Anderson battery for Grade IV? 


TABLE III.—CoRRELATIONS AND INTERCORRELATIONS BETWEEN 
ToTaAL AVERAGE SCORE ON THE NEW STANFORD ACHIEVE- 
MENT TEST AND SUBTESTS OF THE KUHLMANN-ANDER- 
SON INTELLIGENCE TEST IN GRADE 4 














Kuhlmann-Anderson Subtests 
Stanford 
hi 
Achievement | 15 | 16 | 17/18 | 19 | 20 | 21 | 22 | 23 | 24 
Total | Aver-| 38] 49] .57| .53| .38| .10| .71| .75| .71| .35 
Score | age 
K 16 26 
u 
h 17 .18} .28 
] 
m § 18 .21) .31) .26 
a U 
n B} 19 .23) .33) .19) .24 
ie 
E| 20 .13} .19} .67| .20) .14 
A § 
a Ti 3 .31| .35) .51| .42) .32) .10 
d § 
e 22 .39| .42) .52) .38) .30) .12) .65 
r 
s 23 .33) .39) .52) .42) .30) .12) .61| .65 
re) 
n 24 .24| .30) .14|) .25) .41) .11) .32] .29) .32 
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Since one is usually interested in total educational achievement 
rather than achievement in one subject only, one of the composite 
scores from the Stanford Achievement Test was used as the 
criterion; namely, the total average score. 

Table III shows the correlations and intercorrelations between 
the total average score on the New Stanford Achievement Test 
(the criterion), and the subtests of the Kuhlmann-Anderson 
Intelligence Tests (the dependent variables). 

In Table III the first horizontal row shows the correlation 
coefficients between total average score on the New Stanford 
Achievement Test for Grade 4, and scores on each of the subtests 
(15 through 24 inclusive) of the Kuhlmann-Anderson Intelligence 
Tests for Grade 4. The remainder of the table contains the 
interrecorrelations among the subtests. The first row is read as 
follows: The correlation between total average score and sub- 
tests 15 through 24 inclusive, is represented by r, of .38, .42, .57, 
.53, .388, .10, .71, .75, .71, and .35, respectively. In the second 
row the correlation between subtest 16 and subtest 15 of the 
Kuhlmann-Anderson battery is represented by an r of .26. The 
third row is read as follows: The correlation between subtest 17, 
and subtests 15 and 16, is represented by r, of .18 and .28, 
respectively. 

Table III reveals that subtests 21, 22, and 23 on the Kuhlmann- 
Anderson Intelligence Tests are highly correlated with total 
average score on the Stanford Achievement Test (.71, .75, and 
.71, respectively). The relationships between either one of these 
tests (21, 22, and 23) is just as high as that obtained between 
Stanford Achievement total average score and the indices of 
intelligence (MA, IQ, or Pc. Av.) derived from the total of the 
ten subtests when the battery is used as a whole. Coefficients 
between the remaining subtests and total average score range 
from .10 for Test 20 to .57 for Test 17. The median coefficient 
of all ten subtests on the Kuhlmann-Anderson Intelligence Test 
with total average score on the Stanford Achievement Test is 
.475. A median obtained from coefficients covering such a wide 
range is probably of little significance. 

The problem then resolved itself into seeking the optimally 
weighted combination of scores from the subtests of the Kuhl- 
mann-Anderson battery which yielded the maximum multiple 
correlation coefficient with total average score on the Stanford 
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Achievement Test as the criterion. To accomplish this end the 
Multiple Ratio Correlation Method devised by Toops and 
reported by Garfield! was used. 

The multiple ratio analysis shown in Table IV indicates how 
many of the ten subtests may profitably be kept in the battery, 
how each should be weighted to produce the maximal multiple 
correlation coefficient, and the magnitude of the multiple corre- 
lation obtained. In this analysis the single test having the 
largest correlation with the criterion, the ‘backbone test,’ was 
used as the starting datum and arbitrarily weighted at 1.00. To 
this correlation each of the other tests was tentatively added 
after approximately weighting it with respect to the ‘backbone 
test.’ The new correlation was computed and the test yielding 
the highest r was then added to test 1. The composite of the two 
tests thus obtained then became the ‘backbone’ and the process 
was repeated as many times as there were tests. The results of 
this analysis applied to data in Table III are as follows: Test 22 
has the highest correlation (.75) with total average score and is 
the ‘backbone’ weighted as 1.00. The highest r obtained by 
adding each of the other tests is .81 when Test 24 weighted at 
.2043 is added. Addition of any test other than Test 24 even 
when optimally weighted does not produce an r as high as .81. 
Successive additions produce the results shown in Table IV. 

In Table IV the maximal multiple correlation obtained from 
the ten subtests with total average score on the New Stanford 
Achievement Test is .884. The total increase in r obtained by 
adding the remaining nine tests to test 22 is .134 and five tests 
(22, 24, 18, 23, and 21) alone account for the increase from .75 to 
.875. An indication of the predictive value of each successive 
composite of tests is seen in the last two columns of Table IV. 

From column E of Table IV errors in the prediction of total 
average scores from the subtests are reduced from 33.86 per cent 
(using only test 22) to 53.25 per cent by using all ten tests. 
When only five of the ten tests (22, 24, 18, 21, and 23) are used 
a predictive efficiency of 51.77 per cent is obtained. The question 
then arises as to whether the five poorer tests are worth the time 
required for their administration. From the correlation obtained, 
it appears that they are of little value for practical purposes, 





1 Evelyn Garfield, ‘The Measurement of Motor Ability,” Archives of 
Psychology, No. 62, 1922-23. 





236 The Journal of Educational Psychology 


TaBLE IV.—Resutts or MuttipLteE Ratio CORRELATION 
ANALYSIS OF DaTA FROM TABLE III, AND SIGNIFICANCE IN 
TeERMS OF k anp E 


K-A 

Subtests Weighted r k E 

22 1.0000 and criterion r of .750 .6614 33.86 
Adding 24 . 2043 yield multiple r of .810 .5864 41.36 
- ae .3922 “ “ rof .840 .5426 45.74 
“« 2or2] .5543 “ “ roof .860 .5103 48.97 
“ 2lor23 .5576 “ “ —rof .875 .4823 51.77 
. ae .3182 “ “ —rof .880 .4750 52.50 
ee 1727“ “ rof .882 .4712 52.88 
? .1339 “ “ rof .883 .4694 53.04 
“ §% 1085 “ “ rof .883 .4694 53.04 
7 = .0179 “ “ rof .884 .4675 53.25 


especially in relationships with total average score on the New 
Stanford Achievement Test. 

From the preceding analysis two generalizations may be 
drawn: (1) A single test (test 22) from the ten subtests at the 
fourth-grade level of the Kuhlmann-Anderson Intelligence Tests 
has as high relationships with New Stanford Achievement Total 
Average Score as the standard derived measures (MA, IQ, and 
Pe. Av.) obtained from all ten tests. (2) By re-weighting the 
subtests into an optimally weighted battery using all ten tests, 
the predictive efficiency is increased by 17.05 per cent over using 
MA to predict the same criterion (MA with total average score 
yields an r of .77, which is higher than the r of .74 for IQ, or, the 
r of .70 for the Pc. Av.). The E value for the MA correlation of 
.77 is 36.20 per cent and for the Weighted Battery with a multiple 
correlation of .884, E = 53.25. Using only the five best tests 
increases EF from 36.20 per cent for the MA predictor to 51.77 per 
cent for the battery of five tests, or an increase of 15.57 per cent 
in predictive efficiency. 


SUMMARY 


From the analysis of the relationships obtained between the 
Kuhlmann-Anderson Intelligence Test and the New Stanford 
Achievement Test administered at the same time (beginning 
fourth grade) the following conclusions are reached: 
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1) MA appears to be a better predictor of scores on the New 
Stanford Achievement Test than IQ or Pe. Av. except for the 
prediction of EQ, in which case the IQ and Pc. Av. are better. 

2) Arithmetic computation scores are significantly less pre- 
dictable from either MA, IQ, or Pc. Av. than any other Stanford 
Achievement score. 

3) Word meaning and spelling scores are somewhat less pre- 
dictable than any other subtest score. When prediction is made 
from the MA, they are about equally predictable if the IQ or 
Pc. Av. is used as the predictive measure. 

In the study of the relationships (subtest correlations) between 
the Kuhlmann-Anderson Intelligence Test in Grade 4 and the 
New Stanford Achievement Test for Grade 4, the following 
results appear. 

1) The subtests may be grouped into three distinctive groups. 
The first group consists of tests 17 and 20; the second group of 
tests 15, 16, 18, 19, and 24; and the third group of tests 21, 22, 
and 23. 

2) Test 17 is a border-line test with a correlation falling 
between group two (tests 15, 16, 18, 19, and 24) and group three 
(tests 21, 22, and 23), but is not significantly different from 
either. 

3) Tests 21, 22, and 23, and possibly a fourth, 17, have quite 
high correlations with scores on the three reading measures (.65 
to .70), and the three derived scores; namely, total average, EA 
and EQ with coefficients of correlation from .67 to .76. 

4) Arithmetic computation is not related highly with any of 
the subtests (coefficients of correlation from .17 to .47). 

From the analysis of the correlations and inter-correlations 
between the total average score on the New Stanford Achieve- 
ment Test (the criterion) and the subtests of the Kuhlmann- 
Anderson Intelligence Test (the dependent variables) for Grade 4, 
the following generalizations are given: 

1) Subtests 21, 22, and 23, are each significantly correlated 
with Stanford Achievement total average score (r = .71, .75, 
and .71, respectively). 

2) A single one of the Kuhlmann-Anderson subtests (tests 21, 
22, or 23) gives just as high relationships with total average 
score on the New Stanford Achievement Test as the MA, IQ, or 
Pe. Av. derived from the total of the ten subtests. 
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3) Many of the subtests are as highly (or nearly as highly) 
correlated with each other as they are with the criterion. Two 
conditions which a good battery of tests should fulfill when used 
for prediction purposes are, that they should have low corre- 
lations with each other, and high correlations with the criterion. 
These conditions are not met in this instance. 

4) Results of the multiple ratio analysis indicates that five tests 
(22, 24, 18, 23, and 21) may most profitably be kept in the 


battery. 
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NEED FOR SAFEGUARDING 
THE FIELD OF INTELLIGENCE TESTING 


DOUGLAS E. LAWSON 
Southern Illinois Normal University, Carbondale, III. 


When, at the turn of the present century, Binet and Simon 
devised their test for classifying pupils into groups of relative 
brightness, they started something indeed. Today the move- 
ment has reached interesting proportions and already has 
passed through a number of significant phases. 

By the 1920’s various publishers apparently had found tests 
to be lucrative sources of income; numerous educators and 
psychologists were suspecting that there was gold in the hills; and 
‘IQ’ became the watchword of a million teachers (more or less), 
among whom all who had sufficient intelligence themselves to 
work their way out of a revolving-door learned quickly to talk 
glibly about standardized tests, norms, distributions, and 
morons. Thank heaven for the moron! He had given them a 
new lease on self-respect; for, just at the time when teaching was 
losing prestige, a means now had been developed by which the 
moron could be proved to exist, thus giving validity and a 
definitely professional tone to the word. Any teacher with a 
Revised Form A could now talk like a diagnostician. And, 
revolving-doors to the contrary notwithstanding, most of them 
did. 

Probably nobody knows just how many so-called intelligence 
tests have been devised. The Educational Measurements Year- 
book, though following the policy of listing only the better and the 
more recently-published tests, reviews over fifty intelligence tests 
in its 1940 volume. 

Out of the complex movement of intelligence testing a number 
of undeniable evils have grown; and it is the purpose of this paper 
seriously to propose at least a partial remedy. 

To begin with, the administering of an individual test and 
the interpreting of any intelligence test results are tasks which 
require highly specialized skills. For example, in the case of one 
of our outstanding individual intelligence tests, the manual of 
directions for administering and interpreting scores on one form 
comprises sixty pages; while the manual for the other form is 


twice aslarge. The skilled administering of such a test requires a 
240 








Safeguarding the Field of Intelligence Testing 241 


capable understanding of the infinite meanings which a subject 
may give to a question; must assure accurate interpretation of the 
shades of reasoning through which the subject’s mind proceeds in 
stating some of his answers; and requires understanding of the 
statistical significance of the various complex parts of the test. 

Some conscientious educational psychologists who have admin- 
istered group intelligence tests to hundreds or thousands of chil- 
dren will not administer this difficult individual test because they 
feel themselves as yet inadequately trained for a job which 
requires the thorough training and specialized knowledge that it 
presupposes. The expert tester must not only record the answers 
which a subject gives; but in many cases he must know exactly 
why the answers were given. Frequently even the subject 
himself does not know why; does not know by what processes of 
induction, of associational thinking, of deduction, or of trial-and- 
success effort he arrived at a given answer. 

For example, two children are being tested. One of the questions 
may be answered correctly, let us say, by both children. Yet this 
fact does not necessarily mean that both answered it with equal 
intelligence. It does not mean that both arrived at the answer in 
the same way. An analogy might be drawn here from the field of 
medical practice. The doctor who finds two children with 
identical temperatures of, say, one hundred three degrees, does 
not necessarily assume that the diagnosis of one is identical with 
that of the other. He does not use the thermometer as a diag- 
nostic instrument, but merely as a clinical instrument which 
gives him a partial basis for diagnosis. Yet classroom teachers 
have frequently used even the results of the group test for 
diagnostic purposes simply because they have lacked the essential 
skill and training with which to administer and interpret valid 
diagnostic tests. 

To indicate the skill needed in interpreting a relatively simple 
item from an intelligence test, let us take the example of a familiar 
problem formerly much used. The child is shown a diagram of a 
field in the shape of a circle with one opening. The child is to 
imagine that he enters the field and searches for a lost ball. Only 
one point is of paramount importance: he must find the ball with 
the greatest economy of effort, yet he must so proceed as to 
guarantee that he will find it. What kind of a path will he 
describe in his search through the field? 
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The correct answer to this problem is a spiral path, beginning 
at the opening and ending in the center of the field. The child 
of low intelligence usually indicates a random pathway, a series of 
criss-cross wanderings, or even a number of broken lines. But to 
mark as correct the spiral path and to discredit all other answers 
is unsafe. A student who had used a series of parallel lines to 
cover the field later explained his choice in this way: ‘‘If I used a 
spiral, the rate of turning would constantly increase. This 
means that I could not actually follow such a path with the auto- 
matic ease necessary to leave my attention free for the search. I 
would have constantly to watch my previous lap of the spiral, for 
one cannot ‘set himself’ to follow more or less unconsciously a 
curved path whose rate of turning is constantly increasing. 
Otherwise I would have taken a spiral course. I thought of it, 
but I felt that the straight lines would leave my mind and eyes 
more free in my search for the lost ball.” 

Now, the inexperienced tester who had given this test to the 
student had marked his straight-line path as being incorrect. 
Technically, the student was wrong; and in most cases such a 
path would indicate less insight on the student’s part than would 
a spiral path. But it seems obvious that the student in this case 
had more insight than even a bright one who indicates a proper 


spiral. He had clearly seen the advantages of the spiral for 


theoretical purposes ; but he had seen also its disadvantages as 
they would appear in a practical situation. 

Does this difficulty mean that the student’s answers can never 
be evaluated with mathematical precision? Actually, it means 
that the person who interprets the test must be able to understand 
the reason for the student’s answer, must sometimes estimate the 
degree of correctness in a partially correct answer, and must cer- 
tainly be able to educe the mental correlates of any answer whose 
nature is other than that of the simple factual, right-or-wrong, or 
other less complex type. Surely the competent tester must him- 
self be intelligent if he is to appreciate the subtleties of reasoning by 
which many of his subjects will reach their answers. And he 
must have thorough training in the exact techniques of adminis- 
tering the test. 

It seems necessary here to state the obvious; namely, that an 
expert tester should understand the thing he is trying to test. 
Hence, one who tests intelligence, must understand what intelli- 
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gence is. And the writer has seen too many widely divergent 
and sometimes startling definitions of intelligence to suppose that 
even the typical college graduate with a major in education and 
several quarter-hours in psychology has any fundamental under- 
standing of the nature of this complex factor in human behavior. 

Among the actual cases of mal-administering of tests by 
unskilled teachers some few examples may illustrate the need for 
control over the whole field of intelligence testing. For example, 
the writer actually knows a teacher whose child made a low score 
on such a test, was then whipped by the teacher, then re-tested 
with the same instrument. The teacher defended this unique 
procedure by pointing to the fact that the re-test brought a much 
higher score in intelligence. One needs scarcely to mention that 
competent testers do not give their examinations to any child 
who is known to be even slightly disturbed in his emotions if it is 
at all possible to avoid giving the test at such a time. And, 
obviously, to sit down with the child, as this teacher did, pointing 
out his errors on the test before repeating it, is to destroy the 
validity of the results on the re-test, whether or not any emo- 
tional disturbance is present. 

Another case in point is that of a young college student who 
was making good grades in his freshman year. He took an 
intelligence test at the beginning of his sophomore year. Then, 
during all of his second year and most of his junior year, he made 
discouragingly low grades. Finally, when he was placed on 
probation and was brought to the attention of his adviser, he 
expressed the intention of quitting college. He said that his 
intelligence quotient was only 96 and that, consequently, he 
had no hope of success in competition with other college students. 
Asked how he knew about his intelligence score, he said that the 
freshman examiner had told his instructor; the instructor had 
then told the student. The adviser, convinced that the boy 
had a high native intelligence, personally inspected the records 
and found that he stood at the 96th percentile. The nature 
of this gross error on the part of the instructor was then explained 
to the student, who actually made an ‘A’ average during his 
entire senior year. 

A third error that comes to mind was made by a supposedly 
competent child specialist who interpreted an Otis test score in 
terms of a Terman IQ. Since the child ranked in the very 
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superior group, the error appears to be approximately twenty- 
two points in IQ. An understanding of the difference between 
the standard deviation of one test and that of the other is neces- 
sary for a valid conversion of scores such as was attempted by 
this examiner. Results of this kind can be very unfortunate, 
resulting in wrong guidance, wrong educational placement, 
wrong advice to parents, wrong vocational planning, and wrong 
curriculum planning for the child. 

More than once have schools been found to use the results of 
a simple group test as the sole authority for ranking of all pupils, 
for homogeneous grouping, for promotion or retention, and for 
other purposes which actually require a full knowledge of each 
student’s achievement, intelligence, health, opportunity, past 
record, effort, attitudes, background, special abilities, and 
interests. 

Good practice and good ethics require that the pupil’s IQ be 
recorded only when it has been verified by careful and competent 
testing and that it be kept as confidential information to be used 
only by those persons who are professionally competent and who 
can use it in guiding the pupil or planning his program. Many 
specialists make a rule of never telling a pupil or his parents 
just what his IQ score is. 

But in one system of over ten thousand children, an intelligence 
test is administered each year. Then each pupil is given a 
printed slip on which his IQ has been recorded. He is asked to 
take this slip home to his parents. One can readily predict the 
results which this indiscrimate publicizing of IQ scores will 
eventually bring. Disgruntled parents, general loss of public 
respect for the tests as well as for the testers, and damaged 
child personalities may be expected. 

An experienced tester may sometimes spend several hours in 
determining a child’s IQ. Yet the writer knows one superintend- 
ent who solves the problem more simply. Each fall he gives an 
achievement test battery, finds the EA by locating the median 
achievement score, divides the EA by the CA, thus getting a 
quotient which he records as IQ and which is used as the sole 
basis for determining each child’s group classification, though not 
his grade classification. In another case the school has a cumula- 
tive record system in which each child’s MA is recorded from a 
single battery of achievement tests of which an alternate form 
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has been given the previous fall. For those children who were 
absent at the time of the spring test, the principal instructs the 
teachers impartially to add eight months to their MA’s recorded 
in the fall test. And again, in this case, the principal states 
that the spring test scores are the sole basis used in grouping the 
pupils the following year. 

Let us contrast these methods with those used in the case of a 
pupil who was referred to the Bureau of Child Guidance at 
Southern Illinois Normal University. On the day set for testing 
this boy, three expert consultants from the Illinois Institute for 
Juvenile Research in Chicago had come a distance of three 
hundred miles. The boy’s case was one of only five or six 
scheduled for staffing. Any failure of a case to appear for 
examination would seriously disrupt the program. Yet the 
director declined to let this boy be tested at all because the boy 
was obviously upset emotionally. A conflict in dates had 
occurred; and he would have to miss a school picnic in order to 
take the tests. Knowing that the results of the tests would be 
affected by the boy’s disturbed emotions, the director of the 
Bureau postponed the case for a three-month period, at the end 
of which time another joint staffing was to be held. In the mean- 
time, however, the boy showed symptoms of physical deficiencies 
and abnormal nervous reactions. His case was again postponed 
while he underwent medical treatment and finally a series of 
surgical operations. The members of this Bureau recognized the 
fact that many elements enter into the thing we call intelligence 
and that no valid interpretation of test results could be made 
unless all other variables were controlled. 

All of the foregoing examples are bona fide cases of the use— 
and sometimes the misuse—of standardized tests given to deter- 
mine children’s IQ’s and mental ages. 

A notable result growing from unintelligent testing has been 
the bad repute which standardized intelligence tests sometimes 
have among competent professional people in other fields. The 
writer knows a competent college professor in another field who 
tells his students that such tests are invalid. He illustrates his 
point by saying that he himself took a well-known intelligence 
test at two different times separated by an interval of several 
months. On one of the two occasions, he says, he was a moron; 
on the other, a genius! What he does not know is that the weak- 
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ness lay, not in the test, but in the obvious incompetence with 
which it was administered or interpreted to him. 

In the hands of neophytes and amateurs the standardized test- 
ing movement has become a fad; and the IQ symbol, a shibboleth 
with which to frighten parents, awe the general public, and lend 
a specious but professional-sounding note to a great deal of loose 
conversation in school corridors, in faculty conferences, and at 
parent-teacher meetings. 

There is no doubt concerning the value of intelligence tests 
when used and interpreted by skilled specialists for valid pur- 
poses. The specialists know this; but they are helpless in the 
absence of adequate regulations. Numerous cases give clear 
evidence which seems to warrant the statement that mal- 
administered intelligence tests can result in extreme harm to the 
individual. While the results are never so overt or dramatic, it 
seems conservative to state that the damage may often be fully 
comparable to the damage that can result from malpractice in 
medical prescription or surgery. These latter fields are safe- 
guarded by legislative controls. Yet almost anyone can obtain 
and ‘administer’ an intelligence test with complete immunity 
from any charge of malpractice. 

This term ‘malpractice,’ long significant in the field of medicine, 
is one which should be borrowed by the education profession and 
by psychologists. Quackery is not confined to any specific pro- 
fessional fields. And just as one or two other professions have 
taken effective steps to rid themselves of the incompetent, the 
charlatan, and the ignorant practitioner, so should those profes- 
sions which deal with the infinitely delicate mechanisms of a 
child’s personality, emotions, mentality, education, and social 
adjustment. The challenge to competent educational and 
psychological specialists seems unmistakable. 

It is proposed here that the use of intelligence testing devices 
be restricted to persons who can demonstrate competence both 
in administering and interpreting. A knowledge of the tech- 
niques of test construction, a background in psychology, a super- 
vised interneship in administering and interpreting, and an 
adequate knowledge of the clinical aspects of child study are 
recommended. 

Supervised by expert boards set up by state departments of 
education or’ welfare, the administration of laws governing the 
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use of testing devices should be as easy as the administering of 
laws governing the issuance of licenses for medical practice. 
Simply stated, the need seems to be for legislation which will 
provide for two safeguards: (a) control of the promiscuous dis- 
tribution of tests by their publishers, and (b) requirements for 
the licensing of those who are permitted to administer intelligence 
tests; the licensing to be based upon demonstrated competence 
and essential clinical training and academic preparation. Any- 
thing less will be less than professional. 
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A SIMPLE METHOD FOR COMPARING 
THE ACHIEVEMENT OF CLASSES WITH 
THEIR ABILITY 


WILLIAM C. KRATHWOHL 
Illinois Institute of Technology 


High-school or college testing programs, especially those for 
entering freshmen, can, by a method to be described, yield 
information whereby the achievement of a class may be compared 
with its ability. What is more important is that by this same 
method the comparative quality of the class frequently can be 
determined in advance. Such information is of great value to a 
teacher because if his class is poor with respect to other classes, 
he knows he will have to work harder; on the other hand, if his 
class is good he can enrich the subject and give a better course. 
Another application of this method is its use in giving a fair 
comparison of the teaching ability of a group of instructors. 
Frequently the skill of one teacher is compared with another 
merely on the basis of the grades his class achieves at the end of 
a term. Even one college faculty has been compared with 
another solely on the basis of achievement in a nation-wide 
examination. Such a method of comparison is neither fair nor 
scientific since it does not take into account the relative ability 
of the student body to profit by instruction. Its lack of sound- 
ness is evident from the obvious fact that good teachers will 
succeed better with some classes than they will with others. 

The method for making the comparison is as follows: (1) Find 
any test for which test scores have some correlation with indi- 
vidual achievement grades. (2) Find the average score which 
is made by each class on this test and also the average grade 
which the class receives. (3) Find the correlation coefficient 
between average class grades and average test scores. (4) Find 
the line of regression of average class grades against average test 
scores. (5) Make the comparison by noting the position of 
these classes with reference to the line of regression. 

The advantages of such a scheme are its extreme simplicity, 
and the fact that the diagram can be constructed so easily and 
quickly, with the minimum amount of statistics, that much of 
the work can be done by student help. 

—Fhis method is illustrated in the accompanying figure. Each 
dot répresents a class. The numbers accompanying the dots are 
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the code numbers of the instructors. At the Illinois Institute of 
Technology, where this method is used, these code numbers are 
kept strictly confidential except that any instructor may ascertain 
his own code number. 

The abscissa of each point is the average Derived Score which 
the class made on the A.C.E. Psychological Examination. In 
place of Derived Scores any kind of scores can be used such as 
Scaled Scores, Standard Scores, or Raw Scores. In this particu- 
lar case, the Derived Scores used have a mean of 20, and a 
standard deviation of 4. Thus the average Derived Score on the 
Psychological Examination of the class on the extreme right of 
the diagram, taught by instructor No. 4, is 22.1 or slightly more 
than half a standard deviation above the mean. 

The ordinate of each point is the average grade which the 
members of the class received. In changing from letter grades 
to numbers, A is counted as 3, B as 2, Cas 1, Das 0, and Eas —1. 
In place of these, other numerical equivalents can be used. Thus 
the average grade of the class just mentioned is about 1.3, which 
means it is 0.3 of the way between a C and a B. 

The heavy line in the figure running diagonally upward is the 
line of regression, found in the usual manner, of average grades 
against average test scores. 

The two lines running parallel to the line of regression are 
situated from it at a distance of three times the standard error of 
estimate measured parallel to the Y axis. Points outside these 
lines are there for reasons other than errors in sampling. 

The correlation coefficient between average class grades in 
social science and average test scores in the Psychological 
Examination is 0.75. This correlation, as might be expected, 
usually runs higher than that between individual grades and test 
scores. In this instance, the latter equals 0.41 and, as required 
by part (1) of this method, indicates some degree of correlation 
between individual grades and scores. 

Several general conclusions can be drawn from a diagram of 
this kind: 

1) There is quite a variation in average mental ability between 
the classes even though no effort was made to section them. 

2) The median class of this group has an average greater than 
20. This happens because grades were awarded at the end of the 
term after some of the very poor students had dropped the course. 
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Some conclusions will depend on how grades are awarded. In 
some institutions grades are awarded to the group as a whole and 
are not given independently by the instructor. In others, each 
instructor is a law unto himself and gives his grades without 
reference to those given by his colleagues. Sometimes these two 
methods turn out to be the same where the faculty is composed 
of experienced teachers who determine grades conscientiously and 
are not influenced by external considerations. 

If grades are given to the students by some other agency than 
instructors, then the following conclusions can be drawn: 

1) If the class taught by instructor No. 4 at the extreme right, 
with an average grade of 1.3, is compared with the class taught 
by instructor No. 7 at the extreme left, with an average grade 
of 0.3, the difference in average grades is 1, or a whole grade. If 
no allowance is made for the mental ability of the two classes, 
as is often done, it might be said that No. 4 was a better teacher 
than No. 7. However, when by this method the average mental 
ability of the two classes is compared, there is quite a difference. 
The class taught by No. 7 has an average Derived Score on the 
Psychological Examination of 17.0, whereas the class taught by 
No. 4 has an average Derived Score of about 22.1. <A glance at 
the figure shows that both points lie very closely to the line of 
regression. Hence, it may be inferred that instructor No. 7, with 
a weaker class, taught as well as No. 4, with his much brighter 
class. 

2) Since the correlation coefficient between average grades 
and average test scores is 0.75, which is unusually high for 
investigations of this sort, the A.C.E. Psychological Examination 
predicts very well the average grade of a class in social science in 
this college from a knowledge of its average mental ability. 

It is reasonable to assume that for a given institution the line 
of regression will not vary greatly from year to year. Hence, 
if this line has been established before instructor No. 4 meets 
his class, he knows he can enrich his course because of the 
superior mental ability of his students. On the other hand, 
No. 7 knows he will have to work hard to get across the bare 
essentials. 

If instead of classes these two dots represent two different 
colleges, it can be said that the instruction given at No. 7 was as 
good as that given at No. 4, in spite of the fact that the difference 
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in achievement between the two institutions differs by a whole 
grade. 

3) One of the classes taught by No. 5 on X = 20 is farther 
above the line of regression than any of the other classes, and 
close to the upper parallel line. Its position may be due to some 
unusual factor, such as the presence of pupils who were unusually 
industrious or because it was taught at a favorable time of the 
day. The lowest class of No. 4 might have been adversely 
effected by just the opposite conditions. Such information is 
valuable to officers of administration. 

If the grades are awarded by instructors independently of their 
colleagues, the conclusions given above still may be true. Other 
conclusions are: 

1) No instructor in this diagram marks on a ‘curve,’ that is, 
divides up his grades on a percentage basis. If he did, all of his 
classes would have the same average and would be on a line 
parallel to the X axis. 

2) A class might lie above the line of regression because of 
good teaching or because of low standards, whereas a class might 
lie below the line because of poor teaching or unusually high 
standards. However a series of diagrams of this kind, drawn 
over a period of years, frequently will give some evidence about 
either teaching competence or strictness in grading. Teaching 
effectiveness, as will be noticed, is much more difficult to evaluate. 

3) The subject material which a teacher emphasizes in his 
course will sometimes be shown by diagrams comparing average 
class grades with other tests. For instance, if an instructor uses 
any consistency at all in grading, and if he emphasizes the 
mathematical aspects of social science rather than general 
information, and if the comparison test used is one involving 
mathematical aptitude or mathematical training, then his classes 
will tend to lie on a straight line on the figures using those tests. 


SUMMARY 


1) It is possible with very little labor and a minimum of 
statistics to construct a graph which will compare the achieve- 
ment of a class with its ability. 

2) Once such a graph is constructed, a teacher can know in 
advance the quality of the class he is expected to teach and can 
plan accordingly. 
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3) It frequently is possible to make a just and fair estimate of 
the teaching ability of an instructor independently of the quality 
of the class which is assigned to him. 

4) It is possible to evaluate more fairly, on nation-wide 
examinations, the teaching efficiency of the faculty of one college 
in comparison with that of another college. 
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FREDERICK T. Howarp. Complexity of Mental Processes in 
Science Testing. New York: Bureau of Publication, 
Teachers College, Columbia University, 1943, pp. 54. 


The relationship between the ability to recall specific infor- 
mation and the ability to perform more complex mental tasks 
was investigated in this study. Items from the Codéperative 
General Science Test for college students were used. ‘Expert’ 
judges rated the items for complexity. Reliability of combined 
rating by six judges was .92. When rated by good students this 
reliability dropped to .82 and by poor students, to .74. Thus, 
persons well acquainted with the test material can reliably rate 
the items for complexity. For the expert judges, item difficulty 
did not influence judgments of complexity. Five subtests, 
varying in item complexity, were organized. Approximately 
one thousand students from representative colleges and uni- 
versities served as subjects. 

The subtests, which varied in level of item complexity, were 
equivalent as measures of science achievement and apparently of 
menta! ability also. Four factors account for approximately 
ninety-one per cent of the variance. Whatever makes for 
science achievement constitutes the main factor. Item com- 
plexity accounts for only a small per cent of the variance. 

Among the implications of the findings are the following: (1) 
Contrary to previous views, the more complex items in a subject- 
matter test are neither more nor less valid than other items as 
measures of mental ability. (2) The ability to organize (use 
facts, make applications) in a field appears_no different from 
extent of information (recalling specific facts) in that field. 
(3) This suggests that if the student is given the opportunity to 
acquire information, concepts and understanding, he can and 
probably will use them. 

The author seems well aware of the limitations of the study: 
(1) He was not concerned with measuring retention at the end of 
a specific course where intensive review presumably occurs. 
(2) There were some limitations in range of complexity of the 
itemsemployed. (3) The results do not imply that instructional 
methods are unimportant. Nevertheless, if one is concerned 
with retention and use of information some time after exposure 
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to the material, the findings of this investigation are highly 

significant. They should help to correct certain misconceptions 

that have arisen from previous work. Mies A. TINKER 
University of Minnesota 


R. Nevitr Sanrorp, MarGcaret M. Apkins, BretNey R. 
MILLER, AND EvizaBetH A. Coss. Physique, Personality, 
and Scholarship: a Coéperative Study of School Children. 
Washington, D.C.: National Research Council Society for 
Research in Child Development, 1943, pp. 705. 


In recent years psychologists have become interested in the 
investigation of the organism as a whole, that is, in the study and 
understanding of the complete interplay and organization of the 
many factors determining behavior. This movement has been 
given impetus by several publications which indicate that 
German military psychologists are making practical use of 
psychological theories about the whole personality. 

This monograph is a reflection of the trend. It presents the 
results of a three-year exploratory study of physiological, person- 
ality and environmental factors. In the three-year period, 
forty-eight children ranging in age from five to fourteen years 
were studied intensively. Repeated testing of forty-three of 
these subjects permitted longitudinal studies of behavior as well 
as the usual cross-sectional studies. 

The few subjects that were studied and the authors’ dubious 
use of correlation coefficients based on the grouping of subjects 
in the extremes of the distributions tend to vitiate the findings of 
this investigation. However, the authors are aware of the 
homogeneous character of their small population and are careful 
in limiting the nature of the generalizations drawn from their 
data. 

As a pioneer publication in the study of the whole child this 
monograph should serve to guide and crystallize the thoughts of 
the increasing number of investigators who are beginning to 
think in terms of the whole personality. Moreover, it contrib- 
utes to the methodology of experimentation with the whole 
personality by introducing a technique, a so-called syndrome 
technique, that the authors have found to be of practical use. 

The authors are courageous in attempting a task as huge as 
the one they set for themselves. Although they failed to find 
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many of the answers they were seeking, they have made a start 
along a path that will probably become increasingly well used, 
and knowledge of their studies and results should aid future 
investigators in eliminating blind alleys before proceeding with 
their experiments. Davip V. TIEDEMAN 


College Entrance Examination Board 


J. W. WRIGHTSTONE AND E. A. NIFENECKER. Determining 
Readiness for Reading. New York: Educ. Res. Bull. No. 6 
of the Bureau of Reference, Research and Statistics, Board 
of Education of the City of New York, 1943, pp. 49. 


This treatise was planned as a guide to teachers and supervisors 
in the selection and use of records and tests for determining 
reading readiness of first-year pupils. Consideration is given 
to growth in readiness to read, a program for determining reading 
readiness, formal and informal methods of studying beginning 
pupils, techniques to aid interpretation of data obtained, and the 
use of results in school practice. The authors have selected ade- 
quately the pertinent material from the mass of data available 
and presented a well-organized and clear discussion. -Through- 
out there is emphasis upon practical classroom situations. 

The section on interpretation (‘‘Putting together the facts 
about reading readiness’’), however, might have been expanded to 
advantage, since the ordinary teacher is apt to be more inade- 
quate in this aspect of the reading readiness program. Helpful 
additions might include: (1) more stress on interpretation of 
irregular patterns of growth, (2) the use of profile charts of 
abilities, (3) the inclusion of a few case studies. 

In some instances the discussion seems too brief, i.e., too much 
like an outline, to be most effective. It seems to the reviewer 
that undue stress is given the réle of motor development. On 
page 41, where development of ‘visual acuity’ is mentioned, 
undoubtedly visual discriminations is meant since visual acuity 
is not susceptible to training. The bibliography of tests and 
scales would become more useful if material on reliability and 
validity were included. 

This bulletin, like earlier ones in the series, will be welcomed 
by teachers of reading. Mies A. TINKER. 
University of Minnesota. 














